From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66C7119DF4B for ; Mon, 9 Dec 2024 22:28:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783313; cv=none; b=sJE0fAlM31GAEKjTug9FsNQKgQu4LiieBaeVbvSMyW1XwROca3vJ3COyZXy4sqLF4TCqBka058JU9clbiYhN4XK3B39KmSEvasOmsovnzv2YiUjTB4v+mV/8QPVb3IVzAgTNDTK/u8KGTV/4raiFV11KoihFjp2rfVpHugAYJRA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783313; c=relaxed/simple; bh=ZIWVn+NNqRz2kPi5s84h38Zzskf+kt8+k/p5QcGK4dI=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=U2iDp2PSMxZn9lq5JEVVkmrWF4yEROo7yMAeFV7LhKumpj/7CEXSxxZ63vX5yg5WG0EVXU3+xDyULXRFQ7KA0Ft8OypSvCWEQphf7tjNb+i5U+/lDt7CyzvYW6c2tejhxpEgUWg0aWE21oN5sf+K7le/tmRx51WWqpyW1PcUmB4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=pbj32bNr; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="pbj32bNr" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6efed40a56aso11133387b3.3 for ; Mon, 09 Dec 2024 14:28:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783290; x=1734388090; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=jbbfAb21a0u+2XKIw/Ta5dtYvqiwk0ZDZNsXzPyuBS8=; b=pbj32bNrgVmvltDs9NtE1PKXzGmmgcUwRhiyYD5Ag9UdQBuQDxrVOo3jF2y+AebrIb cOoKXHw7wk6P4JsahT0nqKvOAHs7Af4wwRM7DJn/oAoWm7A8LwsvoUAMyzSdC1sivvGy bG2Pt2yuFbX66UTZxuroACqyVsxWmQr3Io9iZJFWyBKskMMAHFQS1C0Rw0smEmkwiHvO v2p3RKP7YhmqFUp6YqW76jBOcRTQkXE9tza+vimwmmCwOSMvnP8Q0ZC3GtcDOrHydnQ4 6az0UwR3fY4Qi9CkjD+3WUo7n5EdS4nYSmczZ5SkYK5EEJGrC5h1oLnmUN0FLiVQYjKY MGZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783290; x=1734388090; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=jbbfAb21a0u+2XKIw/Ta5dtYvqiwk0ZDZNsXzPyuBS8=; b=GemxjlKX8krEl3EcFh6x+dFggOPSe3n2v5z7/gkmuezLWtHz92ZNLD4nHct8GdX7+g urS2ssE5e4wZQe5zVxzp1W4XU4qF6qji0JCO4CfCpFmFVqDjwTUj9CFF0wWPKbNJor0A gGih8MHRuL/nxUzJBwqYhSCWTcKKO7tzyXyeREOinzCrZMRSRMggbKop3VvVEloH+59E gCkZdTPp39+q4NF1ozbmw5MCE2YqvFDruG+KWR5rddQlEOdi8T9/ZR6hQKb6MfGCmCXE lX/NNT/BHgbbDyWqN0Qaog0sUqM9vo9ENBpJknHwgC7sNwcg2x7CRRuG5VecSMb3JDNq ZBLA== X-Forwarded-Encrypted: i=1; AJvYcCVdO2L/AVxlSUYJF0a3Wm4wsc21W/aOVHBHFHHv6SBRssrENS4yI+JN4Vm/5OwlAnrJoajPruBK9mW74CI=@vger.kernel.org X-Gm-Message-State: AOJu0YzMuZ/11ocpK5gvY++Cf0xTEQ1XJXSIkrkUdxVHWu4r29NEm0Po d50eERsdRAQhOaDlCd9fQIpIbHEQ4Zy8Rwb3Al6aDJ89rRCeEs7QNUojA0Uvhzm6k4IfCe/AzXi Gz+o32g== X-Google-Smtp-Source: AGHT+IE1NOCFDsTVYX+CFSNgp/J92Lu2DpPc1b3BW5a1FE7ckDam5CY8sMEyLvI1psBFISDpbMdoWLYI2PPm X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2f85:b0:6ea:1f5b:1f54 with SMTP id 00721157ae682-6efe3ac5ecbmr202257b3.0.1733783290105; Mon, 09 Dec 2024 14:28:10 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:38 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-2-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 01/22] perf vendor events: Update Alderlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.27 to v1.28. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.28: https://github.com/intel/perfmon/commit/801f43f22ec6bd23fbb5d18860f395d61e7= f4081 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/alderlake/adl-metrics.json | 3637 +++++++++++++---- .../pmu-events/arch/x86/alderlake/cache.json | 292 +- .../arch/x86/alderlake/floating-point.json | 19 +- .../arch/x86/alderlake/frontend.json | 19 - .../pmu-events/arch/x86/alderlake/memory.json | 32 +- .../arch/x86/alderlake/metricgroups.json | 10 +- .../pmu-events/arch/x86/alderlake/other.json | 92 +- .../arch/x86/alderlake/pipeline.json | 127 +- .../arch/x86/alderlake/virtual-memory.json | 33 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 3361 insertions(+), 902 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/to= ols/perf/pmu-events/arch/x86/alderlake/adl-metrics.json index 8fdf4a4225de..99d8e86a7ca0 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json @@ -48,13 +48,6 @@ "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "C8 residency percent per package", "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", @@ -62,13 +55,6 @@ "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -112,56 +98,221 @@ "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", - "MetricExpr": "tma_core_bound", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "(5 * cpu_atom@CPU_CLK_UNHALTED.CORE@ - (cpu_atom@TO= PDOWN_FE_BOUND.ALL@ + cpu_atom@TOPDOWN_BE_BOUND.ALL@ + cpu_atom@TOPDOWN_RET= IRING.ALL@)) / (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_atom" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_f= ew_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -170,456 +321,1919 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (5 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", - "MetricGroup": "Default;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (5 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD@ / cpu_atom@CP= U_CLK_UNHALTED.CORE@", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((28 * tma_info_system_core_frequency - 3 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * tma_info_system_core_frequ= ency - 3 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipbranch", + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.CALL@", - "MetricName": "tma_info_br_inst_mix_ipcall", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", - "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL4;tma_L4_g= roup;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", - "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired return Branch Mispre= diction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired Branch Misprediction= ", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipmispredict", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (5 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", - "MetricName": "tma_info_core_cpi", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", - "MetricName": "tma_info_core_ipc", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "cpu_atom@UOPS_RETIRED.ALL@ / cpu_atom@INST_RETIRED.= ANY@", - "MetricName": "tma_info_core_upi", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@ / = cpu_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@ / cp= u_atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_l1_bound", + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS.LOAD@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_load_bound", + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", - "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_store_bound", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (5 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.4K_ALIAS@ / cpu_atom@MEM_U= OPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;Metric;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_ipload", + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", - "MetricName": "tma_info_mem_mix_ipstore", + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_locks_ratio", + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_splits_ratio", + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of mem load uops to all uops", - "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@UOPS_RETIRED.ALL@", - "MetricName": "tma_info_mem_mix_memload_ratio", + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD@ / cpu_atom@CP= U_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricName": "tma_info_serialization _%_tpause_cycles", + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.REF_TSC@ / TSC", - "MetricName": "tma_info_system_cpu_utilization", + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", "Unit": "cpu_atom" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@k / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Summary", - "MetricName": "tma_info_system_kernel_utilization", + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", "Unit": "cpu_atom" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@CPU_CLK_= UNHALTED.REF_TSC@", - "MetricGroup": "Power", - "MetricName": "tma_info_system_turbo_utilization", + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@UOPS_= RETIRED.ALL@", - "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@UOPS_R= ETIRED.ALL@", - "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are microcode op= s", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@UOPS_RET= IRED.ALL@", - "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "Count;SMT", + "MetricName": "tma_info_core_core_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Ret;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@UOPS_RETIRED.ALL@ / cpu_atom@INST_RETIRED.= ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ + = cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@) / cpu_atom@MEM_BOUND_STALLS.IFE= TCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@ / = cpu_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ + cp= u_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@) / cpu_atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@ / cp= u_atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS.LOAD@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.4K_ALIAS@ / cpu_atom@MEM_U= OPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@UOPS_RETIRED.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Metric;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", + "MetricGroup": "GB/sec;HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_system_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC;Metric", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "Core_Metric;SMT", + "MetricName": "tma_info_system_smt_2t_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "Metric;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@UOPS_= RETIRED.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@UOPS_R= ETIRED.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@UOPS_RET= IRED.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@UOPS_RE= TIRED.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(12 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@UOPS_RE= TIRED.ALL@", - "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -628,82 +2242,389 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (5 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (5 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;Core_Clocks;TopdownL6;tma_L6_group;tma_alu= _op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (5 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (5 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (5 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (5 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (5 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (5 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", - "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that result = in retirement slots", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", - "MetricGroup": "Default;TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", - "MetricgroupNoGroup": "TopdownL1;Default", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -715,8 +2636,8 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", "MetricThreshold": "tma_alu_op_utilization > 0.4", @@ -725,17 +2646,17 @@ }, { "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", - "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", "MetricThreshold": "tma_avx_assists > 0.1", @@ -745,7 +2666,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -762,46 +2683,151 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_f= ew_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -810,28 +2836,82 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(25 * tma_info_system_core_frequency * (cpu_core@ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@))) + 24 * tma_info_system_core_frequ= ency * cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@) * (1 + cpu_core@MEM_LOA= D_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info_thre= ad_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((28 * tma_info_system_core_frequency - 3 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * tma_info_system_core_frequ= ency - 3 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -842,107 +2922,107 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "24 * tma_info_system_core_frequency * (cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_F= WD@ * (1 - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HIT_WITH_FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_L= OAD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(cpu_core@IDQ.DSB_CYCLES_ANY@ - cpu_core@IDQ.DSB_CY= CLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D1@ + cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@, max(cpu_core@CYCLE_ACTIVIT= Y.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVITY.CYCLES_L1D_MISS@, 0)) / tma_in= fo_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D1@ + cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "cpu_core@L1D_PEND_MISS.FB_FULL@ / tma_info_thread_c= lks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -953,28 +3033,28 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -984,138 +3064,147 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * cpu_core@ASSISTS.FP@ / tma_info_thread_slots", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / cpu_core@BR_MISP_RETIRED.ALL_BRANCHES@ / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", @@ -1123,15 +3212,15 @@ }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", - "MetricExpr": "cpu_core@INT_MISC.CLEARS_COUNT@ / (cpu_core@BR_MISP= _RETIRED.ALL_BRANCHES@ + cpu_core@MACHINE_CLEARS.COUNT@)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_core" @@ -1146,8 +3235,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", @@ -1155,7 +3244,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1164,142 +3253,36 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ + 2 = * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRED.NOP@) / tma_i= nfo_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_h= it_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_b= ig_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY\\,umask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_= l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_stor= e / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_late= ncy + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@BR_= INST_RETIRED.NEAR_RETURN@) / cpu_core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_callret", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_NTAKEN@ / cpu_core@BR= _INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_nt", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_TAKEN@ / cpu_core@BR_= INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_tk", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_TAKEN@ - cpu_core@BR= _INST_RETIRED.COND_TAKEN@ - 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@) / cpu_= core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_jump", "Unit": "cpu_core" @@ -1313,50 +3296,50 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(cpu_core@CPU_CLK_UNHALTED.DISTRIBUTED@ if #SMT_on = else tma_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_core_core_clk= s", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_core_coreipc", "Unit": "cpu_core" }, { "BriefDescription": "uops Executed per Cycle", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / tma_info_thread_cl= ks", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", "MetricGroup": "Power", "MetricName": "tma_info_core_epc", "Unit": "cpu_core" }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / tma_info_core_core_clks", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", "MetricGroup": "Flops;Ret", "MetricName": "tma_info_core_flopc", "Unit": "cpu_core" }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", @@ -1364,29 +3347,29 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FRONTEND_RETI= RED.ANY_DSB_MISS@", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", "MetricGroup": "DSBmiss;Fed", "MetricName": "tma_info_frontend_ipdsb_miss_ret", "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", @@ -1394,50 +3377,57 @@ }, { "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@BACLEARS.= ANY@", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", - "MetricExpr": "1e3 * cpu_core@FRONTEND_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.CODE_RD_MISS@ / cpu_core@IN= ST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", "MetricGroup": "Fed;LSD", "MetricName": "tma_info_frontend_lsd_coverage", "Unit": "cpu_core" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch.", - "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", "Unit": "cpu_core" }, { "BriefDescription": "Total number of retired Instructions", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "INST_RETIRED.ANY", "MetricGroup": "Summary;TmaL1;tma_L1_group", "MetricName": "tma_info_inst_mix_instructions", "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", @@ -1445,52 +3435,52 @@ }, { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + cpu_core@FP_ARITH_INST_RETIRED.VECTOR@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.128B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.256B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_DOUBLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_SINGLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", @@ -1498,7 +3488,7 @@ }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", "MetricThreshold": "tma_info_inst_mix_ipcall < 200", @@ -1506,7 +3496,7 @@ }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ = + 4 * cpu_core@FP_ARITH_INST_RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_= RETIRED.256B_PACKED_SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", "MetricThreshold": "tma_info_inst_mix_ipflop < 10", @@ -1514,7 +3504,7 @@ }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_LOADS@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", "MetricThreshold": "tma_info_inst_mix_ipload < 3", @@ -1522,14 +3512,14 @@ }, { "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@CPU_CLK_U= NHALTED.PAUSE_INST@", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_ippause", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_STORES@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", "MetricThreshold": "tma_info_inst_mix_ipstore < 8", @@ -1537,7 +3527,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@SW_PREFETCH_A= CCESS.T0\\,umask\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", @@ -1545,10 +3535,10 @@ }, { "BriefDescription": "Instructions per taken branch", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1582,176 +3572,184 @@ }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@= INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_fb_hpki", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.ALL_DEMAND_DATA_RD@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L2_LINES_IN.ALL@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (cpu_core@L2_RQSTS.REFERENCES@ - cpu_core@L2_= RQSTS.MISS@) / cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_HIT@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_load", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "Backend;CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.MISS@ / cpu_core@INST_RETIR= ED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem;Offcore", "MetricName": "tma_info_memory_l2mpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_MISS@ / cpu_= core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.RFO_MISS@ / cpu_core@INST_R= ETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheMisses;Offcore", "MetricName": "tma_info_memory_l2mpki_rfo", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@OFFCORE_REQUESTS.ALL_REQUESTS@ / 1e9 = / duration_time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@LONGEST_LAT_CACHE.MISS@ / 1e9 / durat= ion_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L3_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_l3mpki", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD@ / cp= u_core@OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_data_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS.DEMAND_DATA_RD@", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAN= D_DATA_RD@ / cpu_core@OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", "MetricGroup": "Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l3_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@MEM_LOAD= _COMPLETED.L1_MISS_ANY@", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency", "Unit": "cpu_core" }, { "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "1e3 * cpu_core@SQ_MISC.BUS_LOCK@ / cpu_core@INST_RE= TIRED.ANY@", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_bus_lock_pki", "Unit": "cpu_core" }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_MISC_RETIRED.UC@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki", "Unit": "cpu_core" }, { "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@L1D_PEND= _MISS.PENDING_CYCLES@", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", "MetricGroup": "Mem;MemoryBW;MemoryBound", "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", "Unit": "cpu_core" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "1e3 * cpu_core@ITLB_MISSES.WALK_COMPLETED@ / cpu_co= re@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricGroup": "Fed;MemoryTLB", "MetricName": "tma_info_memory_tlb_code_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "1e3 * cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED@ / c= pu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu_core@ITLB_MISSES.WALK_PENDING@ + cpu_core@DTLB= _LOAD_MISSES.WALK_PENDING@ + cpu_core@DTLB_STORE_MISSES.WALK_PENDING@) / (4= * tma_info_core_core_clks)", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", @@ -1759,58 +3757,58 @@ }, { "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "1e3 * cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED@ / = cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from DSB per c= ycle", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@IDQ.DSB_CYCLES_AN= Y@", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_dsb", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from LSD per c= ycle", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@LSD.CYCLES_ACTIVE@", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_lsd", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from MITE per = cycle", - "MetricExpr": "cpu_core@IDQ.MITE_UOPS@ / cpu_core@IDQ.MITE_CYCLES_= ANY@", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_mite", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per a microcode Assist invocatio= n", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1818,7 +3816,7 @@ }, { "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C0_WAIT@ / tma_info_threa= d_clks", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", "MetricGroup": "C0Wait", "MetricName": "tma_info_system_c0_wait", "MetricThreshold": "tma_info_system_c0_wait > 0.05", @@ -1826,7 +3824,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency", "Unit": "cpu_core" @@ -1840,22 +3838,22 @@ }, { "BriefDescription": "Average number of utilized CPUs", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "Summary", "MetricName": "tma_info_system_cpus_utilized", "Unit": "cpu_core" }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", @@ -1863,22 +3861,23 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -1901,9 +3900,24 @@ "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)", "Unit": "cpu_core" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - cpu_core@CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE@ /= cpu_core@CPU_CLK_UNHALTED.REF_DISTRIBUTED@ if #SMT_on else 0)", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", "MetricGroup": "SMT", "MetricName": "tma_info_system_smt_2t_utilization", "Unit": "cpu_core" @@ -1915,16 +3929,24 @@ "MetricName": "tma_info_system_socket_clks", "Unit": "cpu_core" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "tma_info_thread_clks / cpu_core@CPU_CLK_UNHALTED.RE= F_TSC@", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization", "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", "Unit": "cpu_core" @@ -1934,40 +3956,41 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_thread_clks", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", "MetricGroup": "Ret;Summary", "MetricName": "tma_info_thread_ipc", "Unit": "cpu_core" }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" }, { "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@INS= T_RETIRED.ANY@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", "MetricThreshold": "tma_info_thread_uoppi > 1.05", @@ -1975,10 +3998,19 @@ }, { "BriefDescription": "Uops per taken branch", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { @@ -1987,114 +4019,124 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", - "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (cpu_core@MEM_= LOAD_RETIRED.L3_HIT@ * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@ME= M_LOAD_RETIRED.L1_MISS@ / 2)) / tma_info_thread_clks", + "MetricExpr": "(12 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_2_3_10@ / (3 * tma_in= fo_core_core_clks)", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", @@ -2107,36 +4149,63 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "(16 * max(0, cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ = - cpu_core@L2_RQSTS.ALL_RFO@) + cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu= _core@MEM_INST_RETIRED.ALL_STORES@ * (10 * cpu_core@L2_RQSTS.RFO_HIT@ + min= (cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.C= YCLES_WITH_DEMAND_RFO@))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(cpu_core@LSD.CYCLES_ACTIVE@ - cpu_core@LSD.CYCLES_= OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2147,54 +4216,54 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clk= s", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", @@ -2203,27 +4272,27 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "cpu_core@UOPS_RETIRED.MS@ / tma_info_thread_slots", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", - "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_CYCLES_ANY@ - cpu_core@IDQ.MITE_= CYCLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2232,41 +4301,50 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", - "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (cpu_core@BR_INST_RETIRED.AL= L_BRANCHES@ - cpu_core@INST_RETIRED.MACRO_FUSED@) / (tma_retiring * tma_inf= o_thread_slots)", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2282,89 +4360,89 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", - "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", - "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * cpu_core@ASSISTS.PAGE_FAULT@ / tma_info_thread= _slots", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_0@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_1@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_6@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_= ACTIVE@ < cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOU= ND_ON_LOADS@ else (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu= _core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info_thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + max(cpu= _core@RS.EMPTY\\,umask\\=3D1@ - cpu_core@RESOURCE_STALLS.SCOREBOARD@, 0)) /= tma_info_thread_clks * (cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@E= XE_ACTIVITY.BOUND_ON_LOADS@) / tma_info_thread_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2372,21 +4450,21 @@ { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2394,7 +4472,7 @@ { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2405,98 +4483,98 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", - "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu_core@L= D_BLOCKS.NO_SR@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ / tma_info_= core_core_clks", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_4_9@ + cpu_core@UOPS= _DISPATCHED.PORT_7_8@) / (4 * tma_info_core_core_clks)", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", @@ -2509,46 +4587,73 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", - "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/alderlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/cache.json index 3f51686fe7a8..a20e19738046 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/cache.json @@ -91,6 +91,26 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", "Counter": "0,1,2,3", @@ -101,6 +121,15 @@ "UMask": "0x4", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the total number of L2 Cache accesses.= Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts the total number of L2 Cache Accesses= , includes hits, misses, rejects front door requests for CRd/DRd/RFO/ItoM/= L2 Prefetches only. Counts on a per core basis.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, { "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES]", "Counter": "0,1,2,3", @@ -111,6 +140,26 @@ "UMask": "0xff", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a hit. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a hit from a front door request only (does not include rejects = or recycles), Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a miss. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a miss from a front door request only (does not include rejects= or recycles). Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Read requests with true-miss in L2 cache. [Th= is event is alias to L2_RQSTS.MISS]", "Counter": "0,1,2,3", @@ -412,7 +461,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81", @@ -424,7 +472,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82", @@ -436,7 +483,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83", @@ -448,7 +494,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21", @@ -460,7 +505,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41", @@ -472,7 +516,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42", @@ -484,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11", @@ -496,7 +538,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12", @@ -518,7 +559,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4", @@ -530,7 +570,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2", @@ -542,7 +581,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4", @@ -554,7 +592,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -566,7 +603,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8", @@ -578,7 +614,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2", @@ -590,7 +625,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -602,7 +636,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4", @@ -614,7 +647,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -626,7 +658,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1", @@ -638,7 +669,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8", @@ -650,7 +680,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2", @@ -662,7 +691,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10", @@ -674,7 +702,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4", @@ -686,7 +713,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20", @@ -698,33 +724,90 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required and modified data was for= warded from another core or module.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.HITM", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L2 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x2", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L2 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x4", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required, and non-modified data wa= s forwarded.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.HIT_E_F", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L3 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.L3_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", "Counter": "0,1,2,3,4,5", @@ -776,7 +859,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts the total number of load uops retired= .", "SampleAfterValue": "200003", "UMask": "0x81", @@ -788,7 +870,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts the total number of store uops retire= d.", "SampleAfterValue": "200003", "UMask": "0x82", @@ -802,7 +883,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 128 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -816,7 +896,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 16 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -830,7 +909,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 256 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -844,7 +922,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 32 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -858,7 +935,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 4 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -872,7 +948,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 512 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -886,7 +961,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 64 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -900,7 +974,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 8 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -912,7 +985,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x21", "Unit": "cpu_atom" @@ -923,18 +995,46 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x41", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the total number of load and store uop= s retired that missed in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the second Level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired that m= iss in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired. Cou= nts with or without PEBS enabled.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "PublicDescription": "Counts the number of stores uops retired. Co= unts with or without PEBS enabled. If PEBS is enabled and a PEBS record is = generated, will populate PEBS Latency and PEBS Data Source fields according= ly.", "SampleAfterValue": "1000003", "UMask": "0x6", @@ -950,13 +1050,57 @@ "UMask": "0x3", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and non-modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0001", + "MSRValue": "0x1F803C0001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1022,7 +1166,7 @@ "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0002", + "MSRValue": "0x1F803C0002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1049,6 +1193,72 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, but no data was forwa= rded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and non-modified data= was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and modified data was forwarde= d.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and non-modified data was forw= arded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "OFFCORE_REQUESTS.ALL_REQUESTS", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/alderlake/floating-point.json index b4621c221f58..62fd70f220e5 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json @@ -1,4 +1,13 @@ [ + { + "BriefDescription": "Counts the number of cycles the floating poin= t divider is in the loop stage.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, { "BriefDescription": "ARITH.FPDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -9,6 +18,15 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts all microcode FP assists.", "Counter": "0,1,2,3,4,5,6,7", @@ -187,7 +205,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.FPDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x8", "Unit": "cpu_atom" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json b/tools= /perf/pmu-events/arch/x86/alderlake/frontend.json index 66735a612ebd..c5b3818ad479 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/frontend.json @@ -55,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -68,7 +67,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -81,7 +79,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -94,7 +91,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -107,7 +103,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -120,7 +115,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -133,7 +127,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -146,7 +139,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -159,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -172,7 +163,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -185,7 +175,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -198,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -211,7 +199,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -224,7 +211,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -237,7 +223,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -250,7 +235,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -263,7 +247,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1", "Unit": "cpu_core" @@ -275,7 +258,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -288,7 +270,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1", "Unit": "cpu_core" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/memory.json b/tools/p= erf/pmu-events/arch/x86/alderlake/memory.json index 81a03f53aadc..fa15f5797bed 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/memory.json @@ -133,7 +133,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1", @@ -147,7 +146,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1", @@ -161,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -175,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1", @@ -189,7 +185,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -203,7 +198,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1", @@ -217,7 +211,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1", @@ -231,7 +224,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1", @@ -245,7 +237,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1", @@ -257,12 +248,22 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84400004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -329,6 +330,17 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84404000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/alderlake/metricgroups.json index b54a5fc0861f..855585fe6fae 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json @@ -41,6 +41,7 @@ "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -89,7 +90,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +102,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", @@ -121,10 +125,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -138,5 +145,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/alderlake/other.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/other.json index f95e093f8fcf..a8b23e92408c 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/other.json @@ -1,9 +1,10 @@ [ { - "BriefDescription": "ASSISTS.HARDWARE", + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts. the event also counts for Machine Ordering count.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc1", "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists. the event also counts= for Machine Ordering count.", "SampleAfterValue": "100003", "UMask": "0x4", "Unit": "cpu_core" @@ -50,7 +51,6 @@ "Deprecated": "1", "EventCode": "0xe4", "EventName": "LBR_INSERTS.ANY", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_atom" @@ -66,6 +66,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -88,6 +110,17 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", "Counter": "0,1,2,3", @@ -121,6 +154,39 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5", @@ -143,6 +209,28 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/alderlake/pipeline.json index b7656f77dee9..f5bf0816f190 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json @@ -10,6 +10,16 @@ "UMask": "0x9", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, { "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", "Counter": "0,1,2,3,4,5,6,7", @@ -21,6 +31,24 @@ "UMask": "0x9", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = ARITH.FPDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -32,6 +60,16 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cycles any of the two in= teger dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event counts the cycles the integer divi= der is busy.", "Counter": "0,1,2,3,4,5,6,7", @@ -42,6 +80,24 @@ "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = ARITH.IDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -68,7 +124,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003", "Unit": "cpu_atom" @@ -78,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -89,7 +143,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9", "Unit": "cpu_atom" @@ -99,7 +152,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -109,7 +161,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -120,7 +171,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10", @@ -131,7 +181,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -141,7 +190,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -152,7 +200,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xbf", "Unit": "cpu_atom" @@ -162,7 +209,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -173,7 +219,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -183,7 +228,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -194,7 +238,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -205,7 +248,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -216,7 +258,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -226,7 +267,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9", "Unit": "cpu_atom" @@ -236,7 +276,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2", @@ -247,7 +286,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -257,7 +295,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8", @@ -268,7 +305,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xc0", "Unit": "cpu_atom" @@ -278,7 +314,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20", @@ -290,7 +325,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -300,7 +334,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.REL_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfd", "Unit": "cpu_atom" @@ -311,7 +344,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -322,7 +354,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -332,7 +363,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", "SampleAfterValue": "200003", "Unit": "cpu_atom" @@ -342,7 +372,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -352,7 +381,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -362,7 +390,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -373,7 +400,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10", @@ -384,7 +410,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -394,7 +419,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -405,7 +429,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -415,7 +438,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -426,7 +448,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -436,7 +457,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2", @@ -448,7 +468,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -459,7 +478,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -469,7 +487,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80", "Unit": "cpu_atom" @@ -479,7 +496,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20", @@ -491,7 +507,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -501,7 +516,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8", @@ -512,7 +526,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -523,7 +536,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -616,6 +628,16 @@ "UMask": "0x40", "Unit": "cpu_core" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = CPU_CLK_UNHALTED.REF_TSC_P", + "Counter": "0,1,2,3,4,5", + "Deprecated": "1", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Core crystal clock cycles. Cycle counts are e= venly distributed between active threads in the Core.", "Counter": "0,1,2,3,4,5,6,7", @@ -854,7 +876,6 @@ "BriefDescription": "Counts the total number of instructions retir= ed. (Fixed event)", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses fixed counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -864,7 +885,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -875,7 +895,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003", "Unit": "cpu_atom" @@ -885,7 +904,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "Unit": "cpu_core" @@ -895,7 +913,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10", "Unit": "cpu_core" @@ -905,7 +922,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2", @@ -915,7 +931,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -926,7 +941,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8", @@ -1065,7 +1079,6 @@ "Deprecated": "1", "EventCode": "0x03", "EventName": "LD_BLOCKS.4K_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4", "Unit": "cpu_atom" @@ -1075,7 +1088,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.ADDRESS_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4", "Unit": "cpu_atom" @@ -1095,7 +1107,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.DATA_UNKNOWN", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1244,7 +1255,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xe4", "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", "PublicDescription": "Counts the number of LBR entries recorded. R= equires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and N= PEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]", "SampleAfterValue": "1000003", "UMask": "0x1", @@ -1551,7 +1561,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", "SampleAfterValue": "1000003", "Unit": "cpu_atom" }, @@ -1799,7 +1808,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.ALL", - "PEBS": "1", "SampleAfterValue": "2000003", "Unit": "cpu_atom" }, @@ -1829,7 +1837,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.IDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10", "Unit": "cpu_atom" @@ -1839,7 +1846,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.MS", - "PEBS": "1", "PublicDescription": "Counts the number of uops that are from comp= lex flows issued by the Microcode Sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -1895,7 +1901,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.X87", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_atom" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json index e0d8f3070778..132ce48af6d9 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json @@ -258,5 +258,38 @@ "SampleAfterValue": "1000003", "UMask": "0x90", "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_STORES", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index d503aa7e3594..d9538723927b 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,5 +1,5 @@ Family-model,Version,Filename,EventType -GenuineIntel-6-(97|9A|B7|BA|BF),v1.27,alderlake,core +GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.27,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C68BF19E97E for ; Mon, 9 Dec 2024 22:28:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783298; cv=none; b=qbyHvtfQdOlhmIzjb/hhzxGUHvIDgnClfC4BQF1iyrG+4gFGATZmw1wfYkRJlTgpKsYBFVzG/uHTZGWaLq1FopzNecjSj0eykNzyheOHkRO88eFh0qPk2lv2i/6p6s9EAEmvF42ExiwwlHpNvkbdCpuO/ZX3zErIzgkqgr1LEB0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783298; c=relaxed/simple; bh=tY9BmQ5bfuu1EmDGZTRloRMbp3ORgil1CnE0ZhN2jJo=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=ku7ZkDYWPt8eRPe2XHYzdY2mdrhz8UiIpNGQVYLCU4JITKPuGXp/5/blQVlzshBREN8wv7ycaQZmKR1pnf/uNRjsPwOCZoZRKsxB+HgcwQXcigQrdvpFZ9m6yBnw0VuYFCH2Ds5ZgWtqp8affBM4eQoEvVxPxB2aU6p9UXRWG5c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KBTC5UNV; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KBTC5UNV" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e397938b568so1166242276.0 for ; Mon, 09 Dec 2024 14:28:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783293; x=1734388093; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=PglJcFweXbAeMEIu09TlDWhvgRejzQKC4J0LyAvvaIY=; b=KBTC5UNVVl6uZM6Th+BU5FEMYI/RDT57RBsrpAWxJO+sPH+JG4jIMybTnwKhEu1PFd 78LrNDRjm8mjHi3uZG9sO4AFsk7o3eWqj2vYCxQ5t4vswduDWht6yMKXcviGQVx0mDYS m9CW9kFjEBB3n7+w6xrGiU7DxWwQowHKUhqE1JsZPlq43zamzkbIH4u7SzdLOd/KXgxV WpSdkeGlSrt49nGBwMemn5xNATiO2npbjnODspRvwxpffbmqPC5v313G7tVp7DliBOzL aHq079uTl1GvQyCOJlbiCMkEje7RceDWhmn+cubimJ69PRiEWAsX6J2lKjvHqhcPZcLq TihQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783293; x=1734388093; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=PglJcFweXbAeMEIu09TlDWhvgRejzQKC4J0LyAvvaIY=; b=XqJ9xl+n7Eev+SCYUM6UngxQ/P5aK48rBohlQenKaaUtEPryiALWEGwqnTYkRIUwHV NiBEHy+7Zdvoepkp/cvHGDySuYd+aFlbBQvQ14pwhUH9MMsMv+kUN4dyywSSrLXSgYeM 516x/XCU7YWxKCp3YIkm4Md51QwWhUZgnTbtUIV1PwQHZICWBhR8sKR/f4jpTj0vyOiH /F+R1OmNpfu7EvXHaZC6zre++XS+KL0bh2uw1JenEtt3SRJPMVuFxiyBLlaFInCUwzLy 8n46i9bFn+KsczyFQ8XGpj+jZlnkKaQf27K6IF1RyC+v/yw4Y28LasefcZAX0AkI2zyh C1/A== X-Forwarded-Encrypted: i=1; AJvYcCUzGBfMgJ8UWLSptxgdp0wm1dODpeEuStUH/F1ppzMQWBB0WN4IM3OJOkYu+KeBrvWgRUv8r3JsupKJgWg=@vger.kernel.org X-Gm-Message-State: AOJu0Yxmm95FBEowCWGb1/b5YAFYkMGHrBO6pm9r7NFaQx+IS5it880F 73SMEwTE/AUGmSyGmrMHWF5dYTbTEqKcV16RqOehv2cInH9r+zkPi9tf9Ou9669oLo/9GxTbvJK Pr00McQ== X-Google-Smtp-Source: AGHT+IHYmR+wCzmb56ZROZD1w5fHT8G+NpHcnpWTDnuFJB1ejVkz6Kdg+RnvmmGZycG1FIfw0y538EFp+YxX X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:aa2d:0:b0:e39:95b8:a83b with SMTP id 3f1490d57ef6-e3a0b48b8a6mr7116276.7.1733783292675; Mon, 09 Dec 2024 14:28:12 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:39 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-3-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 02/22] perf vendor events: Update AlderlakeN events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.27 to v1.28. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.28: https://github.com/intel/perfmon/commit/801f43f22ec6bd23fbb5d18860f395d61e7= f4081 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/alderlaken/adln-metrics.json | 83 ++++--- .../pmu-events/arch/x86/alderlaken/cache.json | 227 ++++++++++++++++-- .../arch/x86/alderlaken/floating-point.json | 17 +- .../arch/x86/alderlaken/memory.json | 20 ++ .../pmu-events/arch/x86/alderlaken/other.json | 81 ++++++- .../arch/x86/alderlaken/pipeline.json | 97 +++++--- .../arch/x86/alderlaken/virtual-memory.json | 30 +++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 458 insertions(+), 99 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json b/= tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json index 447596f924ab..141562def5c4 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json @@ -48,13 +48,6 @@ "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "C8 residency percent per package", "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", @@ -62,13 +55,6 @@ "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -89,7 +75,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -98,7 +84,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "(tma_backend_bound >0.10)", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -109,7 +95,7 @@ "MetricExpr": "(5 * CPU_CLK_UNHALTED.CORE - (TOPDOWN_FE_BOUND.ALL = + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricThreshold": "(tma_bad_speculation >0.15)", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%" @@ -119,7 +105,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%" }, @@ -128,7 +114,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -137,7 +123,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -145,7 +131,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -153,7 +139,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", + "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -162,7 +148,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -170,7 +156,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (5 * CPU_CLK_UNH= ALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", "ScaleUnit": "100%" }, { @@ -179,7 +165,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", + "MetricThreshold": "(tma_frontend_bound >0.20)", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -188,7 +174,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -196,7 +182,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -205,7 +191,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (5 * CPU_CLK_UN= HALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -320,6 +306,11 @@ "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_ST= ALLS.IFETCH", "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (MEM_BOUND_STALLS.IFETCH_LLC_HIT + MEM_BOUND_= STALLS.IFETCH_DRAM_HIT) / MEM_BOUND_STALLS.IFETCH", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss" + }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_S= TALLS.IFETCH", @@ -336,6 +327,12 @@ "MetricGroup": "load_store_bound", "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (MEM_BOUND_STALLS.LOAD_LLC_HIT + MEM_BOUND_ST= ALLS.LOAD_DRAM_HIT) / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s" + }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", "MetricExpr": "100 * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STA= LLS.LOAD", @@ -472,6 +469,12 @@ "MetricGroup": "Summary", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", @@ -503,7 +506,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -511,7 +514,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (5 * CPU_C= LK_UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", + "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -520,7 +523,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -528,7 +531,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -536,7 +539,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (5 * CPU_CLK_UNHALTE= D.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", "ScaleUnit": "100%" }, { @@ -544,7 +547,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (5 * CPU_CLK_UNHALTED.CORE= )", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -552,7 +555,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (5 * CPU_CLK_UNHALTED.= CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -560,7 +563,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (5 * CPU_CLK_UNHALTED.C= ORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -568,7 +571,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -576,7 +579,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -586,7 +589,7 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", + "MetricThreshold": "(tma_retiring >0.75)", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -595,7 +598,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/cache.json index 1500033ee19f..fd9ed58c2f90 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/cache.json @@ -1,4 +1,30 @@ [ + { + "BriefDescription": "Counts the total number of L2 Cache accesses.= Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts the total number of L2 Cache Accesses= , includes hits, misses, rejects front door requests for CRd/DRd/RFO/ItoM/= L2 Prefetches only. Counts on a per core basis.", + "SampleAfterValue": "200003" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a hit. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a hit from a front door request only (does not include rejects = or recycles), Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a miss. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a miss from a front door request only (does not include rejects= or recycles). Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", "Counter": "0,1,2,3,4,5", @@ -92,30 +118,81 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required and modified data was for= warded from another core or module.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.HITM", + "SampleAfterValue": "200003", + "UMask": "0x20" + }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x8" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L2 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x2" }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L2 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x10" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x4" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required, and non-modified data wa= s forwarded.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.HIT_E_F", + "SampleAfterValue": "1000003", + "UMask": "0x40" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L3 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.L3_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20" + }, { "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", "Counter": "0,1,2,3,4,5", @@ -154,7 +231,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts the total number of load uops retired= .", "SampleAfterValue": "200003", "UMask": "0x81" @@ -165,7 +241,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts the total number of store uops retire= d.", "SampleAfterValue": "200003", "UMask": "0x82" @@ -178,7 +253,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 128 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -191,7 +265,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 16 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -204,7 +277,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 256 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -217,7 +289,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 32 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -230,7 +301,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 4 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -243,7 +313,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 512 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -256,7 +325,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 64 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -269,7 +337,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 8 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -280,7 +347,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x21" }, @@ -290,28 +356,93 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x41" }, + { + "BriefDescription": "Counts the total number of load and store uop= s retired that missed in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the second Level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11" + }, + { + "BriefDescription": "Counts the number of store ops retired that m= iss in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12" + }, { "BriefDescription": "Counts the number of stores uops retired. Cou= nts with or without PEBS enabled.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "PublicDescription": "Counts the number of stores uops retired. Co= unts with or without PEBS enabled. If PEBS is enabled and a PEBS record is = generated, will populate PEBS Latency and PEBS Data Source fields according= ly.", "SampleAfterValue": "1000003", "UMask": "0x6" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and non-modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0001", + "MSRValue": "0x1F803C0001", "SampleAfterValue": "100003", "UMask": "0x1" }, @@ -351,7 +482,7 @@ "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0002", + "MSRValue": "0x1F803C0002", "SampleAfterValue": "100003", "UMask": "0x1" }, @@ -365,6 +496,66 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, but no data was forwa= rded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and non-modified data= was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and modified data was forwarde= d.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and non-modified data was forw= arded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to instruction cache misses.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json = b/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json index 484d8b3167f0..ed963fcb6485 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json @@ -1,4 +1,20 @@ [ + { + "BriefDescription": "Counts the number of cycles the floating poin= t divider is in the loop stage.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8" + }, { "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", "Counter": "0,1,2,3,4,5", @@ -13,7 +29,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.FPDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x8" } diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json b/tools/= perf/pmu-events/arch/x86/alderlaken/memory.json index 619488d42a4a..3b46b048dfb2 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/memory.json @@ -56,6 +56,16 @@ "SampleAfterValue": "20003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84400004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -95,5 +105,15 @@ "MSRValue": "0x3F84400002", "SampleAfterValue": "100003", "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84404000", + "SampleAfterValue": "100003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/other.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/other.json index 54ddbe2b3b9b..f8c21b7f8f40 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/other.json @@ -5,7 +5,6 @@ "Deprecated": "1", "EventCode": "0xe4", "EventName": "LBR_INSERTS.ANY", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -19,6 +18,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -29,6 +48,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -39,6 +68,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5", @@ -49,6 +108,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json b/tool= s/perf/pmu-events/arch/x86/alderlaken/pipeline.json index f05db45578ff..713ebc21cec0 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json @@ -1,10 +1,59 @@ [ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc" + }, + { + "BriefDescription": "Counts the number of cycles any of the two in= teger dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003" }, @@ -14,7 +63,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9" }, @@ -23,7 +71,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -32,7 +79,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -41,7 +87,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xbf" }, @@ -50,7 +95,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -59,7 +103,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -69,7 +112,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -79,7 +121,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -88,7 +129,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9" }, @@ -97,7 +137,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -106,7 +145,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xc0" }, @@ -116,7 +154,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -125,7 +162,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.REL_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfd" }, @@ -135,7 +171,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -145,7 +180,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -154,7 +188,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", "SampleAfterValue": "200003" }, @@ -163,7 +196,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -172,7 +204,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -181,7 +212,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -190,7 +220,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -200,7 +229,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -210,7 +238,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -219,7 +246,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80" }, @@ -229,7 +255,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -238,7 +263,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -248,7 +272,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -268,6 +291,15 @@ "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. The core frequency may change from time to time. Fo= r this reason this event may have a changing ratio with regards to time. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = CPU_CLK_UNHALTED.REF_TSC_P", + "Counter": "0,1,2,3,4,5", + "Deprecated": "1", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of unhalted reference clock= cycles at TSC frequency. (Fixed event)", "Counter": "Fixed counter 2", @@ -305,7 +337,6 @@ "BriefDescription": "Counts the total number of instructions retir= ed. (Fixed event)", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses fixed counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -315,7 +346,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003" }, @@ -325,7 +355,6 @@ "Deprecated": "1", "EventCode": "0x03", "EventName": "LD_BLOCKS.4K_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -334,7 +363,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.ADDRESS_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -343,7 +371,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.DATA_UNKNOWN", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -392,7 +419,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xe4", "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", "PublicDescription": "Counts the number of LBR entries recorded. R= equires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and N= PEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -588,7 +614,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", "SampleAfterValue": "1000003" }, { @@ -604,7 +629,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.ALL", - "PEBS": "1", "SampleAfterValue": "2000003" }, { @@ -612,7 +636,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.IDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -621,7 +644,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.MS", - "PEBS": "1", "PublicDescription": "Counts the number of uops that are from comp= lex flows issued by the Microcode Sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -631,7 +653,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.X87", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" } diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json = b/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json index ad2b1349bab4..d9c737a17df0 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json @@ -49,5 +49,35 @@ "EventName": "LD_HEAD.DTLB_MISS_AT_RET", "SampleAfterValue": "1000003", "UMask": "0x90" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_STORES", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index d9538723927b..2fa5529ee585 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,6 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core -GenuineIntel-6-BE,v1.27,alderlaken,core +GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core GenuineIntel-6-56,v11,broadwellde,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 48802199239 for ; Mon, 9 Dec 2024 22:28:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783349; cv=none; b=d0+OC0prlq0xKBG4MhaDpCPJFaXwQTmzxjWtIRUqSE238Vdj2SJqNIsQZTBpOVm9XA01qNsUcN2d8grzZ6exG6P4aDAI6PSWdQlQN7mCTBrJxjVoix5swZZnIHWKCDuL+9rwPtnOPQc99cc7gCZdU5sEKoTTixTMa5/u4fkZEOY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783349; c=relaxed/simple; bh=6wVhYSh75gdXsfTP2yQVC4y9niiQya5uip8leCUOTcE=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=fsRPUFsuwS+I0y6CA4kagFEFvm2TcYDJFmEv0JvamG4nOP9O28HlRGGdSjyohEUtocrA4+X8PuqCmocPzd7vn0WzxTuXN0MqcfOb1kecJKnf65tSdkM21ENrwvE9srgkuGG7ImxG0qUivr5Oe9JVLIGUbSjU3FvprWjY3XJCPNc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=tmFDDkhr; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tmFDDkhr" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6eeeebc295dso58227617b3.1 for ; Mon, 09 Dec 2024 14:28:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783296; x=1734388096; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=K/zzAIh4Oi5CnEF8aIBMOMg+M7pL5zuWDUAa+N+Ci9k=; b=tmFDDkhrRG/6O0UftIDYr5MegXy+ml5iVmf1OmDuTjiKdj9U/eJEU2xjCTYXIIXagx /wq6PnHUcvVf7VqLwSzr5pt+0GNCSNQ0v8JNehX7JW+iZxJkkeHI49T1xmXYQyOL0lyE NPGU56cPt6+0jxEwIF4hDgy/6DU0s29w5tA4YGQR+kZQrTIcQv83LxCDhkTzmfzEW499 OsCuyCyoNwITVO0EZEcUKRh6YLutT3wo727klXQi0HGvIpdIdg/kzGkDjExgOHaX0gtl ZN8ZRPPI7AdlL3xs+EEw5Bc/mFs6Ii+UO7xrwErinZUc/Ix8eDyKI4icd5Cs22h5MSfq oIWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783296; x=1734388096; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=K/zzAIh4Oi5CnEF8aIBMOMg+M7pL5zuWDUAa+N+Ci9k=; b=LMXalpxHmuWvtNCpqpw5rv4mg0BR5tmn0AMGXQgr2efSoLIWzmrGfe+gXKza6bdThw HLS9A2JW9IYsIHLkuDGp3TUeKqm6N0tOeJJy3I5z6YV+yqE871LIVtkk651UtrWm7+ge X3rFjBIJrjtZQH8rXXV7+GMd0bHLuCHcKbSyUJuady0JqV35XUsWkoFBJJgw9rlgZLws hIW22/5j1dA6QiFba2JJDBuZoyy53OwAQlo7e2f/ny4gaxmMhe4iF9Bstkc2ClzDqzrS UoXVYugtVKzqq+eNtZ/41kRVUGH40v4ScOfm52iOlOVYrpK6IPtNAuOwG8xsqeCzd8qH yoXQ== X-Forwarded-Encrypted: i=1; AJvYcCWdB/XLJIXK0KGMtY2tGMozvxwFRvyY0nCVUUP6FlLfhskmYafBuYcpTMzRveDRW2PxOwcYKBk9LuPOYL8=@vger.kernel.org X-Gm-Message-State: AOJu0YxaaaxxYGLvvHnn/xNDlR+Z0JdV4TcmeiYGSyXhmpg/uAbP5yfj Izju/5aopVRHvkyrigheT7CdX/o0r7Xw5/HP/KpMVjLGLXxjH3T77e9z6QLxN/jtKW+ieVZkTkS YieKoIw== X-Google-Smtp-Source: AGHT+IHrP+gNzMM0bJGFQ5SmOunsCd7mUcGOxVV1Eo3YK/EY8tTSxm3Z3CJxK6pfja0X7wolDgSE2V+U1UGd X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:20a5:b0:6e9:f188:8638 with SMTP id 00721157ae682-6efe3c82d3bmr626347b3.7.1733783295157; Mon, 09 Dec 2024 14:28:15 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:40 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-4-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 03/22] perf vendor events: Add Arrowlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add events v1.05. Add TMA metrics based on v5.01. Bring in the events from: https://github.com/intel/perfmon/tree/main/ARL/events TMA 5.01 is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/arrowlake/arl-metrics.json | 4597 +++++++++++++++++ .../pmu-events/arch/x86/arrowlake/cache.json | 1444 ++++++ .../arch/x86/arrowlake/floating-point.json | 532 ++ .../arch/x86/arrowlake/frontend.json | 609 +++ .../pmu-events/arch/x86/arrowlake/memory.json | 387 ++ .../arch/x86/arrowlake/metricgroups.json | 150 + .../pmu-events/arch/x86/arrowlake/other.json | 279 + .../arch/x86/arrowlake/pipeline.json | 2298 ++++++++ .../arch/x86/arrowlake/uncore-cache.json | 20 + .../x86/arrowlake/uncore-interconnect.json | 47 + .../arch/x86/arrowlake/uncore-memory.json | 142 + .../arch/x86/arrowlake/uncore-other.json | 10 + .../arch/x86/arrowlake/virtual-memory.json | 522 ++ tools/perf/pmu-events/arch/x86/mapfile.csv | 1 + 14 files changed, 11038 insertions(+) create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.js= on create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/cache.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/floating-point= .json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/frontend.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/memory.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/other.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-interco= nnect.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.= json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory= .json diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json b/to= ols/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json new file mode 100644 index 000000000000..7d0688a8edd8 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json @@ -0,0 +1,4597 @@ +[ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_allocation_restriction", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_m= icrocode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ + = IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_UOPS_= DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0x30@ = / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipflop", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 256-bit in= struction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.256B_DP@ + cpu_atom@FP_INST_RETIRED.256B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx256", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / cpu_atom@INST_RETI= RED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ - = cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@) / cpu_atom@MEM_BOUND_STALLS_IFET= CH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ - cp= u_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@) / cpu_atom@MEM_BOUND_STALLS_LOAD.ALL= @", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;Clocks_Retired;MemoryLat;TopdownL4;tma_L4_gro= up;tma_l1_bound_group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ranch_resteers, tma_mem_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / tma_i= nfo_thread_clks", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_l1_bound= , tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ /= tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (I= DQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_inf= o_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity= + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_latency_dependency = + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bou= nd * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_fb_ful= l + tma_l1_latency_capacity + tma_l1_latency_dependency + tma_lock_latency = + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bou= nd / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_= split_stores + tma_store_latency + tma_streaming_stores)) + tma_memory_boun= d * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_f= alse_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)= ))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "(tma_bottleneck_cache_memory_latency > 20)", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_m= icrocode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fw= d_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bo= und + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency += tma_streaming_stores)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "(tma_bottleneck_memory_data_tlbs > 20)", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "(tma_bottleneck_memory_synchronization > 10)", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "(tma_bottleneck_other_bottlenecks > 20)", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ + = IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_UOPS_= DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "(tma_false_sharing > 0.05) & ((tma_store_bound= > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEN= D_RETIRED.LATENCY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_= dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_miss= es, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0x30@ = / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ranch_resteers, tma_mem_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / tma_i= nfo_thread_clks", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_l1_bound= , tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ /= tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (I= DQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_inf= o_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/cache.json new file mode 100644 index 000000000000..8eb08ee6404b --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/cache.json @@ -0,0 +1,1444 @@ +[ + { + "BriefDescription": "Counts the number of request that were not ac= cepted into the L2Q because the L2Q is FULL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x31", + "EventName": "CORE_REJECT_L2Q.ANY", + "PublicDescription": "Counts the number of (demand and L1 prefetch= ers) core requests rejected by the L2Q due to a full or nearly full w condi= tion which likely indicates back pressure from L2Q. It also counts request= s that would have gone directly to the XQ, but are rejected due to a full o= r nearly full condition, indicating back pressure from the IDI link. The L= 2Q may also reject transactions from a core to insure fairness between cor= es, or to delay a cores dirty eviction when the address conflicts incoming = external snoops. (Note that L2 prefetcher requests that are dropped are no= t counted by this event.)", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines replaced in = L0 data cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.L0_REPLACEMENT", + "PublicDescription": "Counts L0 data line replacements including o= pportunistic replacements, and replacements that require stall-for-replace = or block-for-replace.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D Fill Buffer (FB) unavailability.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.FB_FULL", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D Fill Buffer (FB) unavailability. Demand requests include= cacheable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D due to lack of L2 resources.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.L2_STALLS", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D due to lack of L2 resources. Demand requests include cac= heable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of demand requests that missed L1D cac= he", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.LOAD", + "PublicDescription": "Count occurrences (rising-edge) of DCACHE_PE= NDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increas= es* (at FB alloc or promotion).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of L1D misses that are outstanding", + "Counter": "2", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD", + "PublicDescription": "Counts number of L1D misses that are outstan= ding in each cycle, that is each cycle the number of Fill Buffers (FB) outs= tanding required by Demand Reads. FB either is held by demand loads, or it = is held by non-demand loads and gets hit at least once by demand. The valid= outstanding interval is defined until the FB deallocation by one of the fo= llowing ways: from FB allocation, if FB is allocated by demand from the dem= and Hit FB, if it is allocated by hardware or software prefetch. Note: In t= he L1D, a Demand Read contains cacheable or noncacheable demand loads, incl= uding ones causing cache-line splits and reads due to page walks resulted f= rom any request type.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with L1D load Misses outstanding.", + "Counter": "2", + "CounterMask": "1", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD_CYCLES", + "PublicDescription": "Counts duration of L1D miss outstanding in c= ycles.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache lines filling L2", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.ALL", + "PublicDescription": "Counts the number of L2 cache lines filling = the L2. Counting does not cover rejects.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of demand and prefetch tran= sactions that the External Queue (XQ) rejects due to a full or near full co= ndition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x30", + "EventName": "L2_REJECT_XQ.ANY", + "PublicDescription": "Counts the number of demand and prefetch tra= nsactions that the External Queue (XQ) rejects due to a full or near full c= ondition which likely indicates back pressure from the IDI link. The XQ ma= y reject transactions from the L2Q (non-cacheable requests), BBL (L2 misses= ) and WOB (L2 write-back victims).", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_RQSTS.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read access L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD", + "PublicDescription": "Counts Demand Data Read requests accessing t= he L2 cache. These requests may hit or miss L2 cache. True-miss exclude mis= ses that were merged with ongoing L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0xe1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ANY", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache misses when fetching instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.CODE_RD_MISS", + "PublicDescription": "Counts L2 cache misses when fetching instruc= tions.", + "SampleAfterValue": "200003", + "UMask": "0x24", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests that hit L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT", + "PublicDescription": "Counts the number of demand Data Read reques= ts initiated by load instructions that hit L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS", + "PublicDescription": "Counts demand Data Read requests with true-m= iss in the L2 cache. True-miss excludes misses that were merged with ongoin= g L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_REQUEST.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_REQUEST.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.REFERENCES", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "RFO requests that miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.RFO_MISS", + "PublicDescription": "Counts the RFO (Read-for-Ownership) requests= that miss L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x22", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when L1D is locked", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x42", + "EventName": "LOCK_CYCLES.CACHE_LOCK_DURATION", + "PublicDescription": "This event counts the number of cycles when = the L1D is locked.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x4f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = core has access to an L3 cache, the LLC is the L3 cache, otherwise it is th= e L2 cache. Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x4f", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache. Includes L2 Hit resulting from and L1D eviction= of another core in the same module which is longer latency than a typical = L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which hit in the LLC.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an L1 demand load miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache. Includes L2 Hit resultin= g from and L1D eviction of another core in the same module which is longer = latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts all retired load instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_LOADS", + "PublicDescription": "Counts Instructions with at least one archit= ecturally visible load retired.", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_STORES", + "PublicDescription": "Counts all retired store instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired software prefetch instructions.", + "Counter": "0,1,2,3", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_SWPF", + "PublicDescription": "Counts all retired software prefetch instruc= tions.", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All retired memory instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ANY", + "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", + "SampleAfterValue": "1000003", + "UMask": "0x87", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with locked access.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.LOCK_LOADS", + "PublicDescription": "Counts retired load instructions with locked= access.", + "SampleAfterValue": "100007", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that split across a= cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that split across = a cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_STORES", + "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that hit the STLB.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", + "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that hit the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", + "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0xa", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that miss the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", + "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x11", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that miss the STLB= .", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", + "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x12", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were a cross-core Snoop hits and forwards data from an in on-package core c= ache (induced by NI$)", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were a cross-core Snoop hits and forwards data from an in on-packag= e core cache (induced by NI$)", + "SampleAfterValue": "20011", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were HitM responses from shared L3, Hit-with-FWD is normally excluded.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", + "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3, Hit-with-FWD is normally exclud= ed.", + "SampleAfterValue": "20011", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 hit and cross-core snoop missed in on-pkg core cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 and cross-core snoop hits in on-pkg core cache", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions with at least 1 uncachea= ble load or lock.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd4", + "EventName": "MEM_LOAD_MISC_RETIRED.UC", + "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", + "SampleAfterValue": "100007", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of completed demand load requests that= missed the L1, but hit the FB(fill buffer), because a preceding miss to th= e same cacheline initiated the line to be brought into L1, but data is not = yet ready in L1.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.FB_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L1 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired load instructions with at leas= t one uop that hit in the Level 1 of the L1 data cache.", + "Counter": "0,1,2,3", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT_L1", + "SampleAfterValue": "1000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L1 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L2 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_HIT", + "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L2 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_MISS", + "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", + "SampleAfterValue": "100021", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L3 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", + "SampleAfterValue": "100021", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L3 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", + "SampleAfterValue": "50021", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x40", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "PublicDescription": "Counts the number of load ops retired that h= it in the L2 cache. Includes L2 Hit resulting from and L1D eviction of ano= ther core in the same module which is longer latency than a typical L2 hit.= ", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1c", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "20003", + "UMask": "0x7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to load buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to a load buffer full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to RSV full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to an RSV full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to store buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to a store buffer full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "20003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "MEM_STORE_RETIRED.L2_HIT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of load uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x81", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of store uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_STORES", + "SampleAfterValue": "200003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_STORES", + "SampleAfterValue": "200003", + "UMask": "0x82", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_1024", + "MSRIndex": "0x3F6", + "MSRValue": "0x400", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_2048", + "MSRIndex": "0x3F6", + "MSRValue": "0x800", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the second Level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of store uops retired that = miss in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", + "SampleAfterValue": "200003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Retired memory uops for any access", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe5", + "EventName": "MEM_UOP_RETIRED.ANY", + "PublicDescription": "Number of retired micro-operations (uops) fo= r load or store memory accesses", + "SampleAfterValue": "1000003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches which forwarded th= e unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x20001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Any memory transaction that reached the SQ.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS", + "PublicDescription": "Counts memory transactions reached the super= queue including requests initiated by the core, all L3 prefetches, page wa= lks, etc..", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand and prefetch data reads", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DATA_RD", + "PublicDescription": "Counts the demand and prefetch data reads. A= ll Core Data Reads include cacheable 'Demands' and L2 prefetchers (not L3 p= refetchers). Counting also covers reads due to page walks resulted from any= request type.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cacheable and Non-Cacheable code read request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", + "PublicDescription": "Counts both cacheable and Non-Cacheable code= read requests.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests sent to uncore", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD", + "PublicDescription": "Counts the Demand Data Read requests sent to= uncore. Use it in conjunction with OFFCORE_REQUESTS_OUTSTANDING to determi= ne average latency in the uncore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand RFO requests including regular RFOs, l= ocks, ItoM", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_RFO", + "PublicDescription": "Counts the demand RFO (read for ownership) r= equests including regular RFOs, locks, ItoM.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when offcore outstanding cacheable Cor= e Data Read transactions are present in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "PublicDescription": "Counts cycles when offcore outstanding cache= able Core Data Read transactions are present in the super queue. A transact= ion is considered to be in the Offcore outstanding state between L2 miss an= d transaction completion sent to requestor (SQ de-allocation). See correspo= nding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 outstanding demand da= ta read request is pending.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA= _RD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding demand rfo re= ads transactions in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO", + "PublicDescription": "Counts the number of offcore outstanding dem= and rfo Reads transactions in the super queue every cycle. The 'Offcore out= standing' state of the transaction lasts from the L2 miss until the sending= transaction completion to requestor (SQ deallocation). See the correspondi= ng Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding cacheable Core Data Read = transactions in SuperQueue (SQ), queue to uncore", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "PublicDescription": "Counts the number of offcore outstanding cac= heable Core Data Read transactions in the super queue every cycle. A transa= ction is considered to be in the Offcore outstanding state between L2 miss = and transaction completion sent to requestor (SQ de-allocation). See corres= ponding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding Code Reads transactions i= n the SuperQueue (SQ), queue to uncore, every cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = outstanding demand data read requests pending.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= outstanding demand data read requests pending. Requests are considered o= utstanding from the time they miss the core's L2 cache until the transactio= n completion message is sent to the requestor.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store Read transactions pending for off-core.= Highly correlated.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO", + "PublicDescription": "Counts the number of off-core outstanding re= ad-for-ownership (RFO) store transactions every cycle. An RFO transaction i= s considered to be in the Off-core outstanding state between L2 cache miss = and transaction completion.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts bus locks, accounts for cache line spl= it locks and UC locks.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to= enforce cache coherency for certain memory accesses that need to be done a= tomically. Can be created by issuing an atomic instruction (via the LOCK p= refix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of PREFETCHNTA, PREFETCHW, = PREFETCHT0, PREFETCHT1 or PREFETCHT2 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.ANY", + "SampleAfterValue": "100003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHNTA instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.NTA", + "PublicDescription": "Counts the number of PREFETCHNTA instruction= s executed.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHW instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.PREFETCHW", + "PublicDescription": "Counts the number of PREFETCHW instructions = executed.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT0 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T0", + "PublicDescription": "Counts the number of PREFETCHT0 instructions= executed.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructio= ns executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T1_T2", + "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT= 2 instructions executed.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json new file mode 100644 index 000000000000..602a6b90a386 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json @@ -0,0 +1,532 @@ +[ + { + "BriefDescription": "Cycles when floating-point divide unit is bus= y executing divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.FPDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for floating-point operatio= ns only.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts all microcode FP assists.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.FP", + "PublicDescription": "Counts all microcode Floating Point assists.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.SSE_AVX_MIX", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.SSE_AVX_MIX", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 1st VEC= port (port 0). FP-arith-uops are of type ADD*/SUB*/MUL/FMA*/DPP.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V0", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 2nd VEC= port (port 1)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V1", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 3rd VEC= port (port 5)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V2", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 4th VEC= port", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V3", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.4_FLOPS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.VECTOR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_128B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_256B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 128-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 2 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice = as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 2 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as the= y perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR re= gister need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single precision floating-point instructions retired; some instructions w= ill count twice as noted below. Each count represents 4 computation operat= ions, one for each element. Applies to SSE* and AVX* packed single precisi= on floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT = DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they pe= rform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 4 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perf= orm 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 = calculations per element. The DAZ and FTZ flags in the MXCSR register need = to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed single precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 8 computation= operations, one for each element. Applies to SSE* and AVX* packed single = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 8 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single and 256-bit packed double precision FP instructions retired; some = instructions will count twice as noted below. Each count represents 2 or/a= nd 4 computation operations, 1 for each element. Applies to SSE* and AVX* = packed single precision and packed double precision FP instructions: ADD SU= B HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DP= P and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision and 256-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 2 or/and 4 computation operations, one for each element. = Applies to SSE* and AVX* packed single precision floating-point and packed= double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL= DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB ins= tructions count twice as they perform 2 calculations per element. The DAZ a= nd FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floati= ng-point instructions retired; some instructions will count twice as noted = below. Applies to SSE* and AVX* scalar, double and single precision floati= ng-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB instructions count twice as they perform multiple c= alculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision and double precision floating-point instructions retired; some = instructions will count twice as noted below. Each count represents 1 comp= utational operation. Applies to SSE* and AVX* scalar single precision float= ing-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB= . FM(N)ADD/SUB instructions count twice as they perform 2 calculations per= element. The DAZ and FTZ flags in the MXCSR register need to be set when u= sing these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= double precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar double precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructi= ons count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational scalar doubl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar double precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element. The DAZ and FTZ flags= in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= single precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar single precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB= instructions count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar single precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instr= uctions count twice as they perform 2 calculations per element. The DAZ and= FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic in= structions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic i= nstructions. The DAZ and FTZ flags in the MXCSR register need to be set wh= en using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_128B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_256B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to FP_FLOPS_RETIRED.FP64]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.DP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results [This event is alias to FP_F= LOPS_RETIRED.SP]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results [This event is alias to FP_F= LOPS_RETIRED.DP]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to FP_FLOPS_RETIRED.FP32]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.SP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit double precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0, 1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x1e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer store data port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.STD", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/frontend.json b/tools= /perf/pmu-events/arch/x86/arrowlake/frontend.json new file mode 100644 index 000000000000..fc5f4dd50fe6 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/frontend.json @@ -0,0 +1,609 @@ +[ + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Clears due to Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered w= hen it finds a branch instruction in a fetch line. This is called Unknown B= ranch which occurs for the first time a branch instruction is fetched or wh= en the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Stalls caused by changing prefix length of th= e instruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.LCP", + "PublicDescription": "Counts cycles that the Instruction Length de= coder (ILD) stalls occurred due to dynamically changing prefix length of th= e decoded instruction (by operand size prefix instruction 0x66, address siz= e prefix instruction 0x67 or REX.W for Intel64). Count is proportional to t= he number of prefixes in a 16B-line. This may result in a three-cycle penal= ty for each LCP (Length changing prefix) in a 16-byte chunk.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles the Microcode Sequencer is busy.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.MS_BUSY", + "SampleAfterValue": "500009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "DSB-to-MITE switch true penalty cycles.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x61", + "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", + "PublicDescription": "Decode Stream Buffer (DSB) is a Uop-cache th= at holds translations of previously fetched instructions that were decoded = by the legacy x86 decode pipeline (MITE). This event counts fetch penalty c= ycles when a transition occurs from DSB to MITE.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced DSB miss= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x1", + "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced a critic= al DSB miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x11", + "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ica= che miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced iTLB tru= e miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x14", + "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L1 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L1I_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x12", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L2 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L2_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x13", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 128 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", + "MSRIndex": "0x3F7", + "MSRValue": "0x608006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 16 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", + "MSRIndex": "0x3F7", + "MSRValue": "0x601006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions after front-end starvati= on of at least 2 cycles", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", + "MSRIndex": "0x3F7", + "MSRValue": "0x600206", + "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 256 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", + "MSRIndex": "0x3F7", + "MSRValue": "0x610006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end had at least 1 bubble-slot for a period of 2= cycles which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", + "MSRIndex": "0x3F7", + "MSRValue": "0x100206", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 32 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", + "MSRIndex": "0x3F7", + "MSRValue": "0x602006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 4 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", + "MSRIndex": "0x3F7", + "MSRValue": "0x600406", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 512 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", + "MSRIndex": "0x3F7", + "MSRValue": "0x620006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 64 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", + "MSRIndex": "0x3F7", + "MSRValue": "0x604006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 8 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", + "MSRIndex": "0x3F7", + "MSRValue": "0x600806", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MISP_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "ANT retired branches that got just mispredic= ted", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts flows delivered by the Microcode Seque= ncer", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MS_FLOWS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced STLB (2n= d level TLB) true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.STLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x15", + "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that caused clears due t= o being Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", + "MSRIndex": "0x3F7", + "MSRValue": "0x17", + "PublicDescription": "Number retired branch instructions that caus= ed the front-end to be resteered when it finds the instruction in a fetch l= ine. This is called Unknown Branch which occurs for the first time a branch= instruction is fetched or when the branch is not tracked by the BPU (Branc= h Prediction Unit) anymore.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_HIT", + "PublicDescription": "Counts the number of instructions retired th= at were tagged because empty issue slots were seen before the uop due to In= struction L1 cache miss, that hit in the L2 cache. Includes L2 Hit resulti= ng from and L1D eviction of another core in the same module which is longer= latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0xe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L3_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that hit in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that also missed the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.ACCESSES", + "SampleAfterValue": "200003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.ACCESSES", + "SampleAfterValue": "200003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are present.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.MISSES", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.MISSES", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALLS", + "PublicDescription": "Counts cycles where a code line fetch is sta= lled due to an L1 instruction cache miss. The decode pipeline works at a 32= Byte granularity.", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ICACHE_DATA.STALL_PERIODS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALL_PERIODS", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache tag miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x83", + "EventName": "ICACHE_TAG.STALLS", + "PublicDescription": "Counts cycles where a code fetch is stalled = due to L1 instruction cache tag miss.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Decode Stream Buffer (DSB) is deliveri= ng any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) p= ath.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles DSB is delivering optimal number of Uo= ps", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the D= SB (Decode Stream Buffer) path. Count includes uops that may 'bypass' the I= DQ.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from the Decode Stream Buffer (DSB) path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.DSB_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipe= line) path. During these cycles uops are not being delivered from the Decod= e Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering optimal number of U= ops", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the M= ITE (legacy decode pipeline) path. During these cycles uops are not being d= elivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from MITE path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MITE_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the MITE path. This also means that uops are= not being delivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when uops are being delivered to IDQ w= hile MS is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_CYCLES_ANY", + "PublicDescription": "Counts cycles during which uops are being de= livered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS= ) is busy. Uops maybe initiated by Decode Stream Buffer (DSB) or MITE.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of switches from DSB or MITE to the MS= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_SWITCHES", + "PublicDescription": "Number of switches from DSB (Decode Stream B= uffer) or MITE (legacy decode pipeline) to the Microcode Sequencer.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops initiated by MITE or Decode Stream Buffe= r (DSB) and delivered to Instruction Decode Queue (IDQ) while Microcode Seq= uencer (MS) is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MS_UOPS", + "PublicDescription": "Counts the number of uops initiated by MITE = or Decode Stream Buffer (DSB) and delivered to Instruction Decode Queue (ID= Q) while the Microcode Sequencer (MS) is busy. Counting includes uops that = may 'bypass' the IDQ.", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that when no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CORE", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that when no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to IDQ_BUBBLES.STARVATION_CYCLES]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "Deprecated": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when optimal number of uops was delive= red to the back-end when the back-end is not stalled", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_FE_WAS_OK", + "Invert": "1", + "PublicDescription": "Counts the number of cycles when the optimal= number of uops were delivered by the Instruction Decode Queue (IDQ) to the= back-end of the pipeline when there was no back-end stalls.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are delivered by the IDQ = for 2 or more cycles when backend of the machine is not stalled - normally = indicating a Fetch Latency issue", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.FETCH_LATENCY", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls for 2 or more cycles - normally in= dicating a Fetch Latency issue.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are not delivered by the = IDQ when backend of the machine is not stalled [This event is alias to IDQ_= BUBBLES.CYCLES_0_UOPS_DELIV.CORE]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.STARVATION_CYCLES", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls. [This event is alias to IDQ_BUBBL= ES.CYCLES_0_UOPS_DELIV.CORE]", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/memory.json b/tools/p= erf/pmu-events/arch/x86/arrowlake/memory.json new file mode 100644 index 000000000000..08f01fc66fef --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/memory.json @@ -0,0 +1,387 @@ +[ + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DL1 miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to reques= t buffers full or lock in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to disambiguations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears due to memory orderi= ng conflicts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "PublicDescription": "Counts the number of Machine Clears detected= dye to memory ordering. Memory Ordering Machine Clears may apply when a me= mory read may not conform to the memory ordering rules of the x86 architect= ure", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears due to me= mory ordering caused by a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to memory disambiguation.= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", + "MSRIndex": "0x3F6", + "MSRValue": "0x400", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", + "SampleAfterValue": "53", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 128 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "1009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 16 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 2048 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", + "MSRIndex": "0x3F6", + "MSRValue": "0x800", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", + "SampleAfterValue": "23", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 256 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "503", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 32 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "100007", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 4 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 512 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "101", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 64 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "2003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 8 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", + "SampleAfterValue": "50021", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired memory store access operations. A PDi= st event for PEBS Store Latency Facility.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", + "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFE7F8000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFE7F8000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where data return is pending for a Dem= and Data Read request who miss L3 cache.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_L3_MISS_DEM= AND_DATA_RD", + "PublicDescription": "Cycles with at least 1 Demand Data Read requ= ests who miss L3 cache in the superQ.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = demand data read requests pending that are known to have missed the L3 cach= e.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json new file mode 100644 index 000000000000..855585fe6fae --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json @@ -0,0 +1,150 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Ifetch": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Mem_Exec": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "load_store_bound": "Grouping from Top-down Microarchitecture Analysis= Metrics spreadsheet", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", + "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_resource_bound_group": "Metrics contributing to tma_resource_boun= d category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/other.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/other.json new file mode 100644 index 000000000000..0175b2193201 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/other.json @@ -0,0 +1,279 @@ +[ + { + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.PAGE_FAULT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.PAGE_FAULT", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops executed on seconda= ry integer ports 0,1,2,3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.2ND", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to MISC_RETIRED.LBR_INSERTS]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xe4", + "EventName": "LBR_INSERTS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1E780000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots where no uop= could issue due to an IQ scoreboard that stalls allocation until a specifi= ed older uop retires or (in the case of jump scoreboard) executes. Commonly= executed instructions with IQ scoreboards include LFENCE and MFENCE.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.IQ_JEU_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles the uncore cannot take further request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x2d", + "EventName": "XQ.FULL", + "PublicDescription": "number of cycles when the thread is active a= nd the uncore cannot take any further requests (for example prefetches, loa= ds or stores initiated by the Core that miss the L2 cache).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/arrowlake/pipeline.json new file mode 100644 index 000000000000..35f0932330b7 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json @@ -0,0 +1,2298 @@ +[ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.DIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer and floating-po= int operations.", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when integer divide unit is busy execu= ting divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.IDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer operations only= .", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.ANY", + "PublicDescription": "Counts the number of occurrences where a mic= rocode assist is invoked by hardware. Examples include AD (page Access Dirt= y), FP and AVX related assists.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", + "SampleAfterValue": "200003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts retired JCC (Jump on Conditional Code)= branch instructions retired includes both taken and not taken branches", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Conditional branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "PublicDescription": "Counts conditional branch instructions retir= ed.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired JCC (Jump on Con= ditional Code) branch instructions retired, includes both taken and not tak= en branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Not taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts not taken branch instructions retired= .", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken conditional branch instructions retired= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional branch instructions= retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC (Jump on Condi= tional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Taken backward conditional branch instruction= s retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional branch ins= tructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken forward conditional branch instructions= retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional branch inst= ructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x102", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and Interrupt call and ret= urn", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Far branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "PublicDescription": "Counts far branch instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and interrupt call and ret= urn.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Indirect near branch instructions retired (ex= cluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Direct and indirect near call instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "PublicDescription": "Counts both direct and indirect near call in= structions retired.", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Return instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "PublicDescription": "Counts return instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts taken branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near taken branch instru= ctions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near relative JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_JMP", + "SampleAfterValue": "200003", + "UMask": "0xdf", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= This precise event may be used to get the misprediction cost via the Retir= e_Latency field of PEBS. It fires on the instruction that immediately follo= ws the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", + "SampleAfterValue": "400009", + "UMask": "0x44", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC branch = instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC (Jump o= n Conditional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired. This precise event may be used to get the misprediction cost via t= he Retire_Latency field of PEBS. It fires on the instruction that immediate= ly follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_COST", + "SampleAfterValue": "400009", + "UMask": "0x151", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired. This precise event may be used to get the misprediction = cost via the Retire_Latency field of PEBS. It fires on the instruction that= immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC b= ranch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC (= Jump on Conditional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional mispredict= ed branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward. This precise event may be used to get t= he misprediction cost via the Retire_Latency field of PEBS. It fires on the= instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8001", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken conditional branch instruc= tions retired. This precise event may be used to get the misprediction cost= via the Retire_Latency field of PEBS. It fires on the instruction that imm= ediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x141", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional mispredicte= d branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward. This precise event may be used to get th= e misprediction cost via the Retire_Latency field of PEBS. It fires on the = instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8002", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Miss-predicted near indirect branch instructi= ons retired (excluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", + "SampleAfterValue": "400009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired. This prec= ise event may be used to get the misprediction cost via the Retire_Latency = field of PEBS. It fires on the instruction that immediately follows the mis= predicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", + "SampleAfterValue": "400009", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted near indirect branch instruction= s retired (excluding returns). This precise event may be used to get the mi= sprediction cost via the Retire_Latency field of PEBS. It fires on the inst= ruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_COST", + "SampleAfterValue": "100003", + "UMask": "0xc0", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of near branch instructions retired th= at were mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted taken near branch instructions r= etired. This precise event may be used to get the misprediction cost via th= e Retire_Latency field of PEBS. It fires on the instruction that immediatel= y follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x60", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts the number of mispredicted = ret instructions retired. Non PEBS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted ret instructions retired. This p= recise event may be used to get the misprediction cost via the Retire_Laten= cy field of PEBS. It fires on the instruction that immediately follows the = mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET_COST", + "SampleAfterValue": "100007", + "UMask": "0x48", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 li= ght-weight slower wakeup time but more power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C01", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 light-weight slower wakeup time but more power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.2 li= ght-weight faster wakeup time but less power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C02", + "PublicDescription": "Counts core clocks when the thread is in the= C0.2 light-weight faster wakeup time but less power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 or= C0.2 or running a PAUSE in C0 ACPI state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C0_WAIT", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 or C0.2 power saving optimized states (TPAUSE or UMWAIT instructions)= or running the PAUSE instruction.", + "SampleAfterValue": "2000003", + "UMask": "0x70", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the core is not in a halt st= ate.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Core clocks when a PAUSE is pending.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Pause instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE_INST", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles at TSC frequency.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the thread is not in a halt = state.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles while memory subsystem has an outstand= ing load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "16", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.CYCLES_MEM_ANY", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total execution stalls.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "4", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.STALLS_TOTAL", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.1_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 1 uop = was executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 or 3 uops are executed on a= ll ports and Reservation Station (RS) was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_3_PORTS_UTIL", + "SampleAfterValue": "2000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 2 uops= were executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 3 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.3_PORTS_UTIL", + "PublicDescription": "Cycles total of 3 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 4 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.4_PORTS_UTIL", + "PublicDescription": "Cycles total of 4 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Execution stalls while memory subsystem has a= n outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "5", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_LOADS", + "SampleAfterValue": "2000003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where the Store Buffer was full and no= loads caused an execution stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "2", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_STORES", + "PublicDescription": "Counts cycles where the Store Buffer was ful= l and no loads caused an execution stall.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles no uop executed while RS was not empty= , the SB was not full and there was no outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.EXE_BOUND_0_PORTS", + "PublicDescription": "Number of cycles total of 0 uops executed on= all ports, Reservation Station (RS) was not empty, the Store Buffer (SB) w= as not full and there was no outstanding load.", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction decoders utilized in a cycle", + "Counter": "2", + "EventCode": "0x75", + "EventName": "INST_DECODED.DECODERS", + "PublicDescription": "Number of decoders utilized in a cycle when = the MITE (legacy decode pipeline) fetches instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired.", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "retired macro-fused uops when there is a bran= ch in the macro-fused pair (the two instructions that got macro-fused count= once in this pmon)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.BR_FUSED", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INST_RETIRED.MACRO_FUSED", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.MACRO_FUSED", + "SampleAfterValue": "2000003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired NOP instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.NOP", + "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.PREC_DIST", + "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Iterations of Repeat string retired instructi= ons.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BPClear.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.BPCLEAR_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0xB", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Clears speculative count", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due = to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles after recovery from a branch mi= sprediction or machine clear till the first uop is issued from the resteere= d path.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEAR_RESTEER_CYCLES", + "PublicDescription": "Cycles after recovery from a branch mispredi= ction or machine clear till the first uop is issued from the resteered path= .", + "SampleAfterValue": "500009", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core cycles the allocator was stalled due to = recovery from earlier clear event for this thread", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.RECOVERY_CYCLES", + "PublicDescription": "Counts core cycles when the Resource allocat= or was stalled due to recovery from an earlier branch misprediction or mach= ine clear event.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0x7", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots where uops got dropped", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UOP_DROPPING", + "PublicDescription": "Estimated number of Top-down Microarchitectu= re Analysis slots that got dropped due to non front-end reasons", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.128BIT", + "SampleAfterValue": "1000003", + "UMask": "0x13", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 256-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.256BIT", + "SampleAfterValue": "1000003", + "UMask": "0xac", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 128-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_128", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 128-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 256-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_256", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 256-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.MUL_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.MUL_256", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.SHUFFLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.SHUFFLES", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_128", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_128", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_256", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "False dependencies in MOB due to partial comp= are on address.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "PublicDescription": "Counts the number of times a load got blocke= d due to false dependencies in MOB due to partial compare on address.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address exactly matches an older store whose da= ta is not ready (a.k.a. unknown). unready_fwd", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because its address exactly matches an older store whose data is not= ready.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "The number of times that split load operation= s are temporarily blocked because all resources for handling the split acce= sses are in use.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.NO_SR", + "PublicDescription": "Counts the number of times that split load o= perations are temporarily blocked because all resources for handling the sp= lit accesses are in use.", + "SampleAfterValue": "100003", + "UMask": "0x88", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", + "SampleAfterValue": "100003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because its address partially overlapped with an older store.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles Uops delivered by the LSD, but didn't = come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_ACTIVE", + "PublicDescription": "Counts the cycles when at least one uop is d= elivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles optimal number of Uops delivered by th= e LSD, but did not come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_OK", + "PublicDescription": "Counts the cycles when optimal number of uop= s is delivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops delivered by the LSD.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa8", + "EventName": "LSD.UOPS", + "PublicDescription": "Counts the number of uops delivered to the b= ack-end by the LSD(Loop Stream Detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all machine clears for any reason incl= uding, but not limited to memory ordering, SMC, and FP assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY", + "SampleAfterValue": "20003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears (nukes) of any type.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.COUNT", + "PublicDescription": "Counts the number of machine clears (nukes) = of any type.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to an internal load passing an older store within the = same CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears due to me= mory ordering in which an internal load passes an older store within the sa= me CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of nukes due to memory rena= ming", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE", + "SampleAfterValue": "20003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times that the machine c= lears due to a page fault. Covers both I-Side and D-Side (Loads/Stores) pa= ge faults. A page fault occurs when either the page is not present, or an = access violation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears due to a = page fault. Counts both I-Side and D-Side (Loads/Stores) page faults. A p= age fault occurs when either the page is not present, or an access violatio= n occurs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6f", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Self-modifying code (SMC) detected.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "PublicDescription": "Counts self-modifying code (SMC) detected, w= hich causes a machine clear.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears due to pr= ogram modifying data (self modifying code) within 1K of a recently fetched = code page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "SampleAfterValue": "20003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "LFENCE instructions retired", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe0", + "EventName": "MISC2_RETIRED.LFENCE", + "PublicDescription": "number of LFENCE retired instructions", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "LBR record is inserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of Last Branch Record (LBR)= entries. Requires LBRs to be enabled and configured in IA32_LBR_CTL. [This= event is alias to LBR_INSERTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.NON_C01_MS_SCB", + "PublicDescription": "Counts the number of issue slots not consume= d by the backend due to a micro-sequencer (MS) scoreboard, which stalls the= front-end from issuing from the UROM until a specified older uop retires. = The most commonly executed instruction with an MS scoreboard is PAUSE.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", + "SampleAfterValue": "10000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= s.", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BAD_SPEC_SLOTS", + "PublicDescription": "Number of slots of TMA method that were wast= ed due to incorrect speculation. It covers all types of control-flow or dat= a-related mis-speculations.", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= by branch mispredictions", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS", + "PublicDescription": "Number of TMA slots that were wasted due to = incorrect speculation by (any type of) branch mispredictions. This event es= timates number of speculative operations that were issued but not retired a= s well as the out-of-order engine recovery past a branch misprediction.", + "SampleAfterValue": "10000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TOPDOWN.MEMORY_BOUND_SLOTS", + "Counter": "3", + "EventCode": "0xa4", + "EventName": "TOPDOWN.MEMORY_BOUND_SLOTS", + "SampleAfterValue": "10000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", + "Counter": "Fixed counter 3", + "EventName": "TOPDOWN.SLOTS", + "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.SLOTS_P", + "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", + "SampleAfterValue": "10000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window, including = relevant microcode flows, and while uops are not yet available in the instr= uction queue (IQ) or until an FE_BOUND event occurs besides OTHER and CISC.= Also includes the issue slots that were consumed by the backend but were t= hrown away because they were younger than the mispredict or machine clear.", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to ROB full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REORDER_BUFFER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls.", + "Counter": "37", + "EventName": "TOPDOWN_FE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x9c", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to TOPDOWN_FE_BOUND.ITLB_MISS]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss [This event is ali= as to TOPDOWN_FE_BOUND.ITLB]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", + "Counter": "38", + "EventName": "TOPDOWN_RETIRING.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x72", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of non dec-by-all uops decoded by deco= der", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x76", + "EventName": "UOPS_DECODED.DEC0_UOPS", + "PublicDescription": "This event counts the number of not dec-by-a= ll uops decoded by decoder 0.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on INT EU ALU ports.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.ALU", + "PublicDescription": "Number of ALU integer uops dispatch to execu= tion.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on any INT EU ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.INT_EU_ALL", + "PublicDescription": "Number of integer uops dispatched to executi= on.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched/executed by any of = the 3 JEUs (all ups that hold the JEU including macro; micro jumps; fetch-f= rom-eip)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.JMP", + "PublicDescription": "Number of jump uops dispatch to execution", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on Load ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.LOAD", + "PublicDescription": "Number of Load uops dispatched to execution.= ", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of (shift) 1-cycle Uops dispatched/exe= cuted by any of the Shift Eus", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SHIFT", + "PublicDescription": "Number of SHIFT integer uops dispatch to exe= cution", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched/executed by Slow EU= (e.g. 3+ cycles LEA, >1 cycles shift, iDIVs, CR; *H operation)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SLOW", + "PublicDescription": "Number of Slow integer uops dispatch to exec= ution.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched on STA ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STA", + "PublicDescription": "Number of STA (Store Address) uops dispatch = to execution", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on STD ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STD", + "PublicDescription": "Number of STD (Store Data) uops dispatch to = execution", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 uop was executed per-= thread", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_1", + "PublicDescription": "Cycles where at least 1 uop was executed per= -thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 2 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "2", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_2", + "PublicDescription": "Cycles where at least 2 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 3 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_3", + "PublicDescription": "Cycles where at least 3 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 4 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "4", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_4", + "PublicDescription": "Cycles where at least 4 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of cycles no uops were dispatch= ed to be executed on this thread.", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.STALLS", + "Invert": "1", + "PublicDescription": "Counts cycles during which no uops were disp= atched from the Reservation Station (RS) per thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops to be executed per-= thread each cycle.", + "Counter": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops dispatched.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.X87", + "PublicDescription": "Counts the number of x87 uops executed.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "When 4-uops are requested and only 2-uops are= delivered, the event counts 2. Uops_issued correlates to the number of RO= B entries. If uop takes 2 ROB slots it counts as 2 uops_issued.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops that RAT issues to RS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops that the Resource = Allocation Table (RAT) issues to the Reservation Station (RS).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops issued by the front= end every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops issued by the fron= t end every cycle. When 4-uops are requested and only 2-uops are delivered,= the event counts 2. Uops_issued correlates to the number of ROB entries. = If uop takes 2 ROB slots it counts as 2 uops_issued.", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "UOPS_ISSUED.CYCLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.CYCLES", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.ALL", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles with retired uop(s).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.CYCLES", + "PublicDescription": "Counts cycles where at least one uop has ret= ired.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired uops except the last uop of each inst= ruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.HEAVY", + "PublicDescription": "Counts the number of retired micro-operation= s (uops) except the last uop of each instruction. An instruction that is de= coded into less than two uops does not contribute to the count.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of integer divide uops reti= red", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divide uops reti= red.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "UOPS_RETIRED.MS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of non-speculative switches to the Mic= rocode Sequencer (MS)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS_SWITCHES", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "PublicDescription": "Switches to the Microcode Sequencer", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that are utilized by operations that eventually get retired (commi= tted) by the processor pipeline. Usually, this event positively correlates = with higher performance for example, as measured by the instructions-per-c= ycle metric.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that are utilized by operations that eventually get retired (comm= itted) by the processor pipeline. Usually, this event positively correlates= with higher performance for example, as measured by the instructions-per-= cycle metric. Software can use this event as the numerator for the Retiring= metric (or top-level category) of the Top-down Microarchitecture Analysis = method.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles without actually retired uops.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.STALLS", + "Invert": "1", + "PublicDescription": "This event counts cycles without actually re= tired uops.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json new file mode 100644 index 000000000000..f294852dfbe6 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json @@ -0,0 +1,20 @@ +[ + { + "BriefDescription": "Number of all entries allocated. Includes als= o retries.", + "Counter": "0,1", + "EventCode": "0x35", + "EventName": "UNC_HAC_CBO_TOR_ALLOCATION.ALL", + "PerPkg": "1", + "UMask": "0x8", + "Unit": "HAC_CBO" + }, + { + "BriefDescription": "Asserted on coherent DRD + DRdPref allocatio= ns into the queue. Cacheable only", + "Counter": "0,1", + "EventCode": "0x35", + "EventName": "UNC_HAC_CBO_TOR_ALLOCATION.DRD", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_CBO" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.j= son b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.json new file mode 100644 index 000000000000..15971f0c71db --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.json @@ -0,0 +1,47 @@ +[ + { + "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", + "Counter": "0,1", + "EventCode": "0x81", + "EventName": "UNC_HAC_ARB_REQ_TRK_REQUEST.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI transactions", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI reads", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.READS", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI writes not including Mflush= ", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.WRITES", + "PerPkg": "1", + "UMask": "0x4", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", + "Counter": "0,1", + "EventCode": "0x81", + "EventName": "UNC_HAC_ARB_TRK_REQUESTS.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_ARB" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json b/= tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json new file mode 100644 index 000000000000..783a4f7fd05b --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json @@ -0,0 +1,142 @@ +[ + { + "BriefDescription": "Counts every CAS read command sent from the M= emory Controller 0 to DRAM (sum of all channels).", + "Counter": "0", + "EventCode": "0xff", + "EventName": "UNC_MC0_RDCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS read command sent from the = Memory Controller 0 to DRAM (sum of all channels). Each CAS commands can be= for 32B or 64B of data.", + "UMask": "0x20", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every read and write request entering = the Memory Controller 0.", + "Counter": "2", + "EventCode": "0xff", + "EventName": "UNC_MC0_TOTAL_REQCOUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every read and write request entering= the Memory Controller 0 (sum of all channels). All requests are counted as= one, whether they are 32B or 64B Read/Write or partial/full line writes. S= ome write requests to the same address may merge to a single write command = to DRAM. Therefore, the total request count may be higher than total DRAM B= W.", + "UMask": "0x10", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every CAS write command sent from the = Memory Controller 0 to DRAM (sum of all channels).", + "Counter": "1", + "EventCode": "0xff", + "EventName": "UNC_MC0_WRCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS write command sent from the= Memory Controller 0 to DRAM (sum of all channels). Each CAS commands can = be for 32B or 64B of data.", + "UMask": "0x30", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every CAS read command sent from the M= emory Controller 1 to DRAM (sum of all channels).", + "Counter": "3", + "EventCode": "0xff", + "EventName": "UNC_MC1_RDCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS read command sent from the = Memory Controller 1 to DRAM (sum of all channels). Each CAS commands can be= for 32B or 64B of data.", + "UMask": "0x20", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "Counts every read and write request entering = the Memory Controller 1.", + "Counter": "5", + "EventCode": "0xff", + "EventName": "UNC_MC1_TOTAL_REQCOUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every read and write request entering= the Memory Controller 1 (sum of all channels). All requests are counted as= one, whether they are 32B or 64B Read/Write or partial/full line writes. S= ome write requests to the same address may merge to a single write command = to DRAM. Therefore, the total request count may be higher than total DRAM B= W.", + "UMask": "0x10", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "Counts every CAS write command sent from the = Memory Controller 1 to DRAM (sum of all channels).", + "Counter": "4", + "EventCode": "0xff", + "EventName": "UNC_MC1_WRCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS write command sent from the= Memory Controller 1 to DRAM (sum of all channels). Each CAS commands can = be for 32B or 64B of data.", + "UMask": "0x30", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "ACT command for a read request sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x24", + "EventName": "UNC_M_ACT_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "ACT command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x26", + "EventName": "UNC_M_ACT_COUNT_TOTAL", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "ACT command for a write request sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x25", + "EventName": "UNC_M_ACT_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Read CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x22", + "EventName": "UNC_M_CAS_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Write CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x23", + "EventName": "UNC_M_CAS_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "PRE command sent to DRAM due to page table id= le timer expiration", + "Counter": "0,1,2,3,4", + "EventCode": "0x28", + "EventName": "UNC_M_PRE_COUNT_IDLE", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "PRE command sent to DRAM for a read/write req= uest", + "Counter": "0,1,2,3,4", + "EventCode": "0x27", + "EventName": "UNC_M_PRE_COUNT_PAGE_MISS", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Number of bytes read from DRAM, in 32B chunks= . Counter increments by 1 after receiving 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3A", + "EventName": "UNC_M_RD_DATA", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Total number of read and write byte transfers= to/from DRAM, in 32B chunks. Counter increments by 1 after sending or rece= iving 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3C", + "EventName": "UNC_M_TOTAL_DATA", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Number of bytes written to DRAM, in 32B chunk= s. Counter increments by 1 after sending 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3B", + "EventName": "UNC_M_WR_DATA", + "PerPkg": "1", + "Unit": "iMC" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json new file mode 100644 index 000000000000..1ac5b5ef8094 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json @@ -0,0 +1,10 @@ +[ + { + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", + "Counter": "FIXED", + "EventCode": "0xff", + "EventName": "UNC_CLOCK.SOCKET", + "PerPkg": "1", + "Unit": "CLOCK" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json new file mode 100644 index 000000000000..a3e4a4f3ab45 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json @@ -0,0 +1,522 @@ +[ + { + "BriefDescription": "Counts the number of page walks initiated by = a demand load that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "PublicDescription": "Counts loads that miss the DTLB (Data TLB) a= nd hit the STLB (Second level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a demand load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a demand load.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Load miss in all TLB levels causes a page wal= k that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes= ) caused by demand data loads. This implies it missed in the DTLB and furth= er levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "SampleAfterValue": "200003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data loads. This implies address translations missed in the= DTLB and further levels of TLB. The page walk can end with or without a fa= ult.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a demand= load in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a demand load in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a store that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts the number of first level TLB misses = but second level hits due to a demand load that did not start a page walk. = Accounts for all page sizes. Will result in a DTLB write from STLB.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stores that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts stores that miss the DTLB (Data TLB) = and hit the STLB (2nd Level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all pages sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a store.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a store.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store misses in all TLB levels causes a page = walk that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes= ) caused by demand data stores. This implies it missed in the DTLB and furt= her levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "SampleAfterValue": "2000003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data stores. This implies address translations missed in th= e DTLB and further levels of TLB. The page walk can end with or without a f= ault.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 2M or 4M pages. Includes page walks= that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a store = in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a store in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction fetch requests that miss the ITLB= and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.STLB_HIT", + "PublicDescription": "Counts instruction fetch requests that miss = the ITLB (Instruction TLB) and hit the STLB (Second-level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x120", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for code (instruction fetch) request.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a code (instruction fetch) request= .", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to any page size. Include= s page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0xe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes)= caused by a code fetch. This implies it missed in the ITLB (Instruction TL= B) and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to any page size. Include= s page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (2M/4M)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M page size= s) caused by a code fetch. This implies it missed in the ITLB (Instruction = TLB) and further levels of TLB. The page walk can end with or without a fau= lt.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (4K)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K page sizes) = caused by a code fetch. This implies it missed in the ITLB (Instruction TLB= ) and further levels of TLB. The page walk can end with or without a fault.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for an outst= anding code request in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for an outstanding code (instruction fetch) request in the PMH (Page Miss H= andler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 2fa5529ee585..154402fa1d39 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,7 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.28,alderlaken,core +GenuineIntel-6-C[56],v1.05,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core GenuineIntel-6-56,v11,broadwellde,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CA6019DFA5 for ; Mon, 9 Dec 2024 22:28:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783304; cv=none; b=EpAj5GPlzX3H7lQkf7rDdqpoUBdFfokiNPvY6Jj4Xr3nGHU42U/UsveAzRSkXCGBXVDdMbXRWJvWpEou6Q9ITIOGlDGa2H6EqZy8mSfNV3lWW1XpsDVXXlmsA4mlEWEo90nEByLYCkgq+AOTNojSd4LLdEOpKfolfVycgOJ2TDY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783304; c=relaxed/simple; bh=cPwy+06CfWtnC5kpgMjdvGMPQkYIDisl3Q2U9eUd3FU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=LNuxzkvVBtoVF69tWMT8eFg1LEcA2yVmCrpUDuIY2aMMAy7s1o6lJUviSatYB0DkYTADBmN8wP+rc7KGXxxHWLmphySDtJUh/zHHzW96s0xks8M3Uv0TRzLFBP1z9+Lf9Z6zGalT/8OqTgDdNattHsabT3R8OkCyvnwmPj18ElE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=K4RacJUx; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="K4RacJUx" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6eeefea3746so40637337b3.1 for ; Mon, 09 Dec 2024 14:28:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783297; x=1734388097; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=cRMwMiRsRhxZnh4gCOrbf7X9VjmAAiF6KYLeXgOhKE4=; b=K4RacJUxtFcd9eVoj8Fi1L2y2uRWcGo48LLkvNWwJXFFpmLEWfL+aLIc9MJY1LINA8 tTov/KUzwmbsww+bgD2tlvoO+W2HW2yI+dr6/OO3U7tkahUH2ip6lD7HN1YVuDpMtW+2 UOrpSc61BD1C9etkEejel2UZrGo6YL9R7Q681vgdYvlejsfHar+87ri+oveRfqYACLXU qucsUnLgBFosfZAWRugBmRDexcUi+F6Dy31f2P8SkaXzBEcuxbih+gC4asUK1R9k7lcA ALW9A3ETSwRW5E+RuDy+hYWVp46v+CtFef1LGrJXfiP1Xlsj+CagysMHZWotiZMpnXP8 yFKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783297; x=1734388097; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=cRMwMiRsRhxZnh4gCOrbf7X9VjmAAiF6KYLeXgOhKE4=; b=WhVAiSrIm/f9uas7K0J1/vABe0j0+Q80HBUZLMZ9FLHThk1y+XTkTr2gOAlUEVecJm IftQIKbtDrrw12uxofG5Uta7Jc+ZQ2TPcAK4W0rw6R1ZQkmodX0+wyJGhCQHIPg7fkFv /IRJQGt800gZnPy3ZCgvCAjr1ti6DkmrvMNJRiO6+m+iN6oJPkcTUBhUDL1v+YRvHyZP M67vD5R2sXwP24G89xI3HBV1l4lp1z+QCoB/7pX5gfF14uM4KGNxN5Hd+0mpZYvNEwEM 8hA7ENXMiIu3iDQTzuAiw+8XAZAs961N0FmpTjvbgtKAEAndb/wruIUx2su88oxJ2eSi xoQA== X-Forwarded-Encrypted: i=1; AJvYcCWR9d1zAaTBh0eNtvyxiGu4b4CAlYW7AkL73bntzK8IAGROL6+z3O7roY/lkz1sMkkyON+fNMe5+7IMdtM=@vger.kernel.org X-Gm-Message-State: AOJu0Yw1lU4CMKN6VRdfPVFW83SLEh3Z/IRiis9hLiWiPgFhUYa35uOC 6ExchslH2rzUXtADmcB1FtsbAvNdYLIcRTHLfC8g/2zJiXVeMc1Y+Z4M8Cl+PrbxReaLWEbeCEg n7+FcJg== X-Google-Smtp-Source: AGHT+IE5iUVKvlbHrv60yCmnxag9tmlDitqu+iaQASM/wi26rYw/81nVssMIFehtEKXyjJ4DyHXZrS/gzBNb X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:3303:b0:6ef:c86b:af2c with SMTP id 00721157ae682-6efe3abe5edmr106647b3.0.1733783297335; Mon, 09 Dec 2024 14:28:17 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:41 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-5-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 04/22] perf vendor events: Update Broadwell events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v29 to v30. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v30: https://github.com/intel/perfmon/commit/9a1827b2ac3927a455ae7df5aa3d1e1b10e= 69f15 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwell/bdw-metrics.json | 312 ++++++++++-------- .../pmu-events/arch/x86/broadwell/cache.json | 10 +- .../arch/x86/broadwell/frontend.json | 4 +- .../pmu-events/arch/x86/broadwell/memory.json | 8 +- .../arch/x86/broadwell/metricgroups.json | 5 + .../arch/x86/broadwell/pipeline.json | 10 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 192 insertions(+), 159 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json b/to= ols/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json index af620553f958..40970fa5566c 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -143,8 +143,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -160,10 +160,10 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", "ScaleUnit": "100%" }, { @@ -174,7 +174,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -183,8 +183,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -192,8 +192,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -202,8 +202,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -212,7 +212,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -220,45 +220,45 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM = / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Rela= ted metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -287,7 +287,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -295,8 +295,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -304,8 +304,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -313,8 +313,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -322,8 +322,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -333,33 +333,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -370,7 +370,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -391,11 +391,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -414,7 +414,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -432,7 +438,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -440,7 +446,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -448,7 +454,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -456,7 +462,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -464,7 +470,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -506,7 +512,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -529,7 +535,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -541,7 +547,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -583,7 +589,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -602,7 +608,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -628,26 +634,26 @@ }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", + "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + c= pu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + cpu@DTLB_STORE_MISSES.= WALK_DURATION\\,cmask\\=3D0x1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DT= LB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_cor= e_core_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -665,14 +671,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -682,13 +688,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -697,6 +704,19 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -709,6 +729,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -716,7 +743,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -725,14 +752,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -758,24 +786,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -783,8 +811,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -793,8 +821,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -803,8 +831,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -812,18 +840,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -840,10 +868,10 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -854,15 +882,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -871,7 +899,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -883,7 +911,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -900,8 +928,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", "ScaleUnit": "100%" }, { @@ -910,7 +938,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -918,8 +946,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -928,7 +956,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -937,7 +965,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -973,7 +1001,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -982,7 +1010,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1000,43 +1028,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1055,8 +1083,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1064,16 +1092,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1082,8 +1110,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1091,18 +1119,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1118,7 +1146,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1127,8 +1155,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/broadwell/cache.json b/tools/pe= rf/pmu-events/arch/x86/broadwell/cache.json index 063ec8c2b2a1..fd3df87318c5 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -401,7 +401,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -412,7 +412,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -601,7 +601,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -664,7 +664,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/frontend.json b/tools= /perf/pmu-events/arch/x86/broadwell/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/memory.json b/tools/p= erf/pmu-events/arch/x86/broadwell/memory.json index 77fbfe99a522..bbc5d7deea6b 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -2226,7 +2226,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "2", + "PEBS": "1", "PublicDescription": "Number of times RTM abort was triggered .", "SampleAfterValue": "2000003", "UMask": "0x4" @@ -2290,7 +2290,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/broadwell/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json b/tools= /perf/pmu-events/arch/x86/broadwell/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 154402fa1d39..01f6a4a82ed2 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -3,7 +3,7 @@ GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-C[56],v1.05,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core -GenuineIntel-6-(3D|47),v29,broadwell,core +GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v11,broadwellde,core GenuineIntel-6-4F,v22,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1EE419EEA1 for ; Mon, 9 Dec 2024 22:28:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783309; cv=none; b=BcuNrk7CArWk+V/5JCA0DazEqEdGSoA9RVnjzhR4PxILt+hf+rBU2W5PU0M/9k3jOEhsZt8T3r9qSuE3jeVRc0GhFWIwj7PoPd5rarQ7joYDWh3IAwGu9WW5BQgjjBhRXjydoMfdxfZRdzvExjzG+jfohUqIBuLjHky8++4NK+s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783309; c=relaxed/simple; bh=QMD/NQd3y/kFOV/Q4ktNFvHsKetL2uEzjTfTDVW9G1M=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=u1sWtolG8UosoqIlsjpRsNaJk6innCMey0KSG/w9/vp4a/TxSxAJYU+dQvbqNvGepDSbbXjglTZUniw0NW3iNTW2Mh0uBFGO/Ol+mmDdnsj7jO/YPvX60u2xntLywMM32lX3y2z9lvQTnpYBjO8cAk+NwcInQyGmCS8nYtQ1/jI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SFt8VZMP; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SFt8VZMP" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6efed40a56aso11134727b3.3 for ; Mon, 09 Dec 2024 14:28:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783300; x=1734388100; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=V6mBkNzFLG1O9p0KXOTai1cQrkH8lnceohAy4x5XCtI=; b=SFt8VZMPZpyiK3FZPMt7kY2+ZDlghz5K5bkhgRP2uoIoT+yzOjQgLPkFchNPtnRKpY yWAb7bAIZa9HD4fOgQDhKD1wyId+zrdO485TIPzV4TBrQJW44MJ19jZ1OOTid467aJbc FwJizfH+ESJ/YrHqNUJfxPwE74BFfI0Y3PLbMbSN63RXIQkLKQWqwCceaVhLkWBOaZ3i i5ygJX75rVZkEDm9/UwuU8FGdldlHFQtLwzU4n5DdWXO2I6UfNKRFtR7rZxh1QA71jr2 6yjTJaEcwgGdUsBilXUJR/8gQt880Kt80d5Vx/80g5vx2Jdi/Q1cmRWQKFkjRbY0wWXN Qg4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783300; x=1734388100; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=V6mBkNzFLG1O9p0KXOTai1cQrkH8lnceohAy4x5XCtI=; b=FpBrpHpxtsUGJ3Upamwz3ZAdUEXuynPfo1TDs7vKhCoTXgrorGwp6IkGtpY5aif7BT 1ON4ftQR1+TY6uBYUr2d6NMAFMWH8DNDdC9qCXFPs9wC5msg9nD5bNze3be7rNKlhEtb zRe3/276LrmF5PdOHdFXtgsKq/I4SyxhvagmmZdXdjFxZ5mL+KdjIZOGR72sVqJ2csnv wLewLfwsA9g9uL7WQTLX55q7zrJPp9Ee8rrVFbrjgBKBhH8OKVN+j3d7vG4cm/1OdpZO sftxxi8TD+O5CGzUwAo+T0SsuufUJCBKbCWWjPxj2id60UVYU8Xj2dhRd8lH+A2tSMZu l/iw== X-Forwarded-Encrypted: i=1; AJvYcCXeJimBLK9Ki3Clo76IcjAV9BAs8pkXC8aCPwQFomMth9nqWiy2wyIH0G5lIaLHyy41UNoxE0cYvBK96uY=@vger.kernel.org X-Gm-Message-State: AOJu0YzzXAmdDfzQLzNelpYwHHWnCKsgLDo+EJXkSOmgQiJTfoYhTyNm kG53QY4dsxxcXen0+1+hsBT64eqP5FGliU9vy7fW5TR/SgwQdhTLSJOOIYaGsjpWoqNPIUesUnF uP3A0pw== X-Google-Smtp-Source: AGHT+IEqUuLU7+ggSEwPAgTeezC5Tl4PHPr1PuY2u36QyZo4W0X/UjwMgPdcypH7cMJnTwwmMuDJ+BKnwJkR X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2506:b0:6f0:22ad:43f8 with SMTP id 00721157ae682-6f022ad7574mr114267b3.7.1733783299719; Mon, 09 Dec 2024 14:28:19 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:42 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-6-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 05/22] perf vendor events: Update BroadwellDE events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v11 to v12. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v12: https://github.com/intel/perfmon/commit/e0b83388d545e527933031ddb2a1d22d650= 40de1 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwellde/bdwde-metrics.json | 256 ++++++++++-------- .../arch/x86/broadwellde/cache.json | 10 +- .../arch/x86/broadwellde/frontend.json | 4 +- .../arch/x86/broadwellde/memory.json | 6 +- .../arch/x86/broadwellde/metricgroups.json | 5 + .../arch/x86/broadwellde/pipeline.json | 10 +- .../arch/x86/broadwellde/uncore-cache.json | 28 +- .../x86/broadwellde/uncore-interconnect.json | 16 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 9 files changed, 184 insertions(+), 153 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json = b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json index 2e1380248684..b03a5f2bcd82 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json @@ -74,7 +74,7 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_4k_aliasing > 0.2) & ((tma_l1_bound > 0.1= ) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, @@ -84,7 +84,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", - "MetricThreshold": "tma_alu_op_utilization > 0.4", + "MetricThreshold": "(tma_alu_op_utilization > 0.4)", "ScaleUnit": "100%" }, { @@ -92,7 +92,7 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "(tma_assists > 0.1) & ((tma_microcode_sequence= r > 0.05) & ((tma_heavy_operations > 0.1)))", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma= _retiring)", "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.2", + "MetricThreshold": "(tma_backend_bound > 0.2)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" @@ -112,7 +112,7 @@ "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * = (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)= ) / tma_info_thread_slots", "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricThreshold": "(tma_bad_speculation > 0.15)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" @@ -123,7 +123,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricThreshold": "(tma_branch_mispredicts > 0.1) & ((tma_bad_spe= culation > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" @@ -133,7 +133,7 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_branch_resteers > 0.05) & ((tma_fetch_lat= ency > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, @@ -143,7 +143,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "(tma_cisc > 0.1) & ((tma_microcode_sequencer >= 0.05) & ((tma_heavy_operations > 0.1)))", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_clears_resteers > 0.05) & ((tma_branch_re= steers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15))= ))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -160,9 +160,9 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_contested_accesses > 0.05) & ((tma_l3_bou= nd > 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", "ScaleUnit": "100%" }, @@ -172,7 +172,7 @@ "MetricExpr": "tma_backend_bound - tma_memory_bound", "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricThreshold": "(tma_core_bound > 0.1) & ((tma_backend_bound >= 0.2))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" @@ -183,7 +183,7 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_data_sharing > 0.05) & ((tma_l3_bound > 0= .05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, @@ -192,8 +192,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "MetricThreshold": "(tma_divider > 0.2) & ((tma_core_bound > 0.1) = & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -202,8 +202,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "(tma_dram_bound > 0.1) & ((tma_memory_bound > = 0.2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4= _UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", - "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_dsb > 0.15) & ((tma_fetch_bandwidth > 0.2= ))", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, @@ -220,7 +220,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_dsb_switches > 0.05) & ((tma_fetch_latenc= y > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_frontend_dsb_cove= rage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -229,7 +229,7 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_dtlb_load > 0.1) & ((tma_l1_bound > 0.1) = & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, @@ -238,7 +238,7 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_dtlb_store > 0.05) & ((tma_store_bound > = 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, @@ -246,9 +246,9 @@ "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", - "MetricThreshold": "tma_fb_full > 0.3", + "MetricThreshold": "(tma_fb_full > 0.3)", "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, @@ -257,9 +257,9 @@ "MetricExpr": "tma_frontend_bound - tma_fetch_latency", "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", - "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_fetch_bandwidth > 0.2)", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -267,7 +267,7 @@ "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE= / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", - "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricThreshold": "(tma_fetch_latency > 0.1) & ((tma_frontend_bou= nd > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" @@ -277,7 +277,7 @@ "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", - "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "MetricThreshold": "(tma_fp_arith > 0.2) & ((tma_light_operations = > 0.6))", "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, @@ -286,7 +286,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "(tma_fp_scalar > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -295,7 +295,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "(tma_fp_vector > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -304,8 +304,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_fp_vector_128b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -313,8 +313,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_fp_vector_256b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -322,7 +322,7 @@ "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_thread_slots= ", "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricThreshold": "(tma_frontend_bound > 0.15)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" @@ -332,9 +332,9 @@ "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", - "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricThreshold": "(tma_heavy_operations > 0.1)", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" }, { @@ -342,23 +342,23 @@ "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_icache_misses > 0.05) & ((tma_fetch_laten= cy > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "(tma_info_bad_spec_ipmisp_indirect < 1000)" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", - "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" + "MetricThreshold": "(tma_info_bad_spec_ipmispredict < 200)" }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", @@ -396,7 +396,7 @@ "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_= UOPS + IDQ.MS_UOPS)", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", - "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 4 > 0.35", + "MetricThreshold": "(tma_info_frontend_dsb_coverage < 0.7) & ((tma= _info_thread_ipc / 4) > 0.35)", "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_inst_mix_iptb, tma_lcp" }, { @@ -405,6 +405,12 @@ "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", @@ -423,7 +429,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", - "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith < 10)", "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { @@ -431,7 +437,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", - "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_avx128 < 10)", "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { @@ -439,7 +445,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", - "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_avx256 < 10)", "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { @@ -447,7 +453,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", - "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_dp < 10)", "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -455,7 +461,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", - "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_sp < 10)", "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -463,42 +469,42 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", - "MetricThreshold": "tma_info_inst_mix_ipbranch < 8" + "MetricThreshold": "(tma_info_inst_mix_ipbranch < 8)" }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", - "MetricThreshold": "tma_info_inst_mix_ipcall < 200" + "MetricThreshold": "(tma_info_inst_mix_ipcall < 200)" }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", - "MetricThreshold": "tma_info_inst_mix_ipflop < 10" + "MetricThreshold": "(tma_info_inst_mix_ipflop < 10)" }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", - "MetricThreshold": "tma_info_inst_mix_ipload < 3" + "MetricThreshold": "(tma_info_inst_mix_ipload < 3)" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", - "MetricThreshold": "tma_info_inst_mix_ipstore < 8" + "MetricThreshold": "(tma_info_inst_mix_ipstore < 8)" }, { "BriefDescription": "Instructions per taken branch", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "(tma_info_inst_mix_iptb < 4 * 2 + 1)", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -521,7 +527,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -533,7 +539,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -575,7 +581,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -594,7 +600,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -623,10 +629,10 @@ "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", - "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" + "MetricThreshold": "(tma_info_memory_tlb_page_walks_utilization > = 0.5)" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "BriefDescription": "", "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" @@ -639,7 +645,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -657,14 +663,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -674,7 +680,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "(tma_info_system_ipfarbranch < 1000000)" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", @@ -687,7 +693,20 @@ "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", - "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" + "MetricThreshold": "(tma_info_system_kernel_utilization > 0.05)" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -701,6 +720,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "(tma_info_system_time < 1)" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -743,31 +769,31 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", - "MetricThreshold": "tma_info_thread_uoppi > 1.05" + "MetricThreshold": "(tma_info_thread_uoppi > 1.05)" }, { "BriefDescription": "Uops per taken branch", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "(tma_info_thread_uptb < 4 * 1.5)" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_itlb_misses > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "(tma_l1_bound > 0.1) & ((tma_memory_bound > 0.= 2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -775,8 +801,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "(tma_l2_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -785,7 +811,7 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_l3_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, @@ -795,7 +821,7 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_l3_hit_latency > 0.1) & ((tma_l3_bound > = 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_mem_latency", "ScaleUnit": "100%" }, @@ -804,7 +830,7 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "MetricThreshold": "(tma_lcp > 0.05) & ((tma_fetch_latency > 0.1) = & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, @@ -813,7 +839,7 @@ "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", - "MetricThreshold": "tma_light_operations > 0.6", + "MetricThreshold": "(tma_light_operations > 0.6)", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" @@ -824,7 +850,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT= .PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 *= tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", - "MetricThreshold": "tma_load_op_utilization > 0.6", + "MetricThreshold": "(tma_load_op_utilization > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%" }, @@ -832,9 +858,9 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_lock_latency > 0.2) & ((tma_l1_bound > 0.= 1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -844,7 +870,7 @@ "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricThreshold": "(tma_machine_clears > 0.1) & ((tma_bad_specula= tion > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" @@ -852,9 +878,9 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_mem_bandwidth > 0.2) & ((tma_dram_bound >= 0.1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -863,7 +889,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_mem_latency > 0.1) & ((tma_dram_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -873,7 +899,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB= ) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UO= PS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ipc > 1.8 else UOPS_EX= ECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", - "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricThreshold": "(tma_memory_bound > 0.2) & ((tma_backend_bound= > 0.2))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" @@ -883,7 +909,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.M= S_UOPS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", - "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "MetricThreshold": "(tma_microcode_sequencer > 0.05) & ((tma_heavy= _operations > 0.1))", "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, @@ -892,7 +918,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_mispredicts_resteers > 0.05) & ((tma_bran= ch_resteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0= .15))))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= ", "ScaleUnit": "100%" }, @@ -901,7 +927,7 @@ "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES= _4_UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", - "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_mite > 0.1) & ((tma_fetch_bandwidth > 0.2= ))", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, @@ -910,7 +936,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_ms_switches > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -919,8 +945,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_core_cl= ks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", - "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_0 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -928,8 +954,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", - "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_1 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -937,7 +963,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_2", - "MetricThreshold": "tma_port_2 > 0.6", + "MetricThreshold": "(tma_port_2 > 0.6)", "ScaleUnit": "100%" }, { @@ -945,7 +971,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_3", - "MetricThreshold": "tma_port_3 > 0.6", + "MetricThreshold": "(tma_port_3 > 0.6)", "ScaleUnit": "100%" }, { @@ -953,7 +979,7 @@ "MetricExpr": "tma_store_op_utilization", "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_= utilization_group", "MetricName": "tma_port_4", - "MetricThreshold": "tma_port_4 > 0.6", + "MetricThreshold": "(tma_port_4 > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 4 (Store-data). Related metrics: t= ma_split_stores", "ScaleUnit": "100%" }, @@ -962,7 +988,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", - "MetricThreshold": "tma_port_5 > 0.6", + "MetricThreshold": "(tma_port_5 > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, t= ma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, = tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -971,8 +997,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", - "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_6 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, = tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5,= tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -980,7 +1006,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", "MetricName": "tma_port_7", - "MetricThreshold": "tma_port_7 > 0.6", + "MetricThreshold": "(tma_port_7 > 0.6)", "ScaleUnit": "100%" }, { @@ -989,7 +1015,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_ports_utilization > 0.15) & ((tma_core_bo= und > 0.1) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, @@ -998,7 +1024,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_0 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, @@ -1007,7 +1033,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_1 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1016,7 +1042,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_2 > 0.15) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1025,7 +1051,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_3m > 0.4) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, @@ -1034,7 +1060,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_thread_slots", "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricThreshold": "((tma_retiring > 0.7)|(tma_heavy_operations > = 0.1))", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" @@ -1045,7 +1071,7 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_split_loads > 0.3)", "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, @@ -1054,16 +1080,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_split_stores > 0.2) & ((tma_store_bound >= 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_sq_full > 0.3) & ((tma_l3_bound > 0.05) &= ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1072,7 +1098,7 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_store_bound > 0.2) & ((tma_memory_bound >= 0.2) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, @@ -1081,7 +1107,7 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_store_fwd_blk > 0.1) & ((tma_l1_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, @@ -1089,9 +1115,9 @@ "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_store_latency > 0.1) & ((tma_store_bound = > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, @@ -1100,7 +1126,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", - "MetricThreshold": "tma_store_op_utilization > 0.6", + "MetricThreshold": "(tma_store_op_utilization > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, @@ -1109,7 +1135,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_unknown_branches > 0.05) & ((tma_branch_r= esteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15)= )))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -1118,7 +1144,7 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "MetricThreshold": "(tma_x87_use > 0.1) & ((tma_fp_arith > 0.2) & = ((tma_light_operations > 0.6)))", "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/cache.json b/tools/= perf/pmu-events/arch/x86/broadwellde/cache.json index 315d7f041731..49d8de8f1b51 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -434,7 +434,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -445,7 +445,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -634,7 +634,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -697,7 +697,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json b/too= ls/perf/pmu-events/arch/x86/broadwellde/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/memory.json b/tools= /perf/pmu-events/arch/x86/broadwellde/memory.json index 31a74eed2f7d..9fca9a4eeb48 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -280,7 +280,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json b= /tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json b/too= ls/perf/pmu-events/arch/x86/broadwellde/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json b= /tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json index f5b5ae1150c3..bfc499fdfdeb 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json @@ -608,7 +608,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1654,7 +1654,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1664,7 +1664,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2679,7 +2679,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2689,7 +2689,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2699,7 +2699,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2709,7 +2709,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3239,7 +3239,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3249,7 +3249,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3259,7 +3259,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3679,7 +3679,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3689,7 +3689,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3699,7 +3699,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3709,7 +3709,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect= .json b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json index 58031f397168..5ccc49cb66a6 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json @@ -33,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -43,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -53,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -63,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -73,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -83,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -93,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -103,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 01f6a4a82ed2..a442ace16a67 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -4,7 +4,7 @@ GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-C[56],v1.05,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core -GenuineIntel-6-56,v11,broadwellde,core +GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v22,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A06A19EEC0 for ; Mon, 9 Dec 2024 22:28:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783317; cv=none; b=g/s2aUqzJKJ1fUTcg5WzGK/G+iD03VCWhJPbDLXVZg71iSoiFpzVFSycKSQY3M2iyHTCZfKnGTwoVqiO13Nul38ixnLf4ePOL9TrEtYzw80C2bKsN21ZQI40WITXbKX7yBAtY5inmQGUt+mg/ERcnm14fCqxHp50l+5/rmPsZIU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783317; c=relaxed/simple; bh=Nb9/ahZBLZkr016VcAQkap5UxZep8G/T438YaePuYxM=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=B96uFB/TaS2hvPVnyqaMGCOLt6m0gPePUrvdDR1nABNO5iAcQMfFahC9ylPuo7Z4a0GNzj5DMjwujf9l8uxlG7wukGXCEnFy54vrmJ+1n6CbA4RXFgLfx8ENp/Hw+g6Y0gn1yJEVCcd8o2FmUiJwSDSsFOQ1xcNOP/P2Ts0w7Zw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Ial+DFo4; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Ial+DFo4" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ef4e426a3fso8620887b3.0 for ; Mon, 09 Dec 2024 14:28:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783302; x=1734388102; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=jRha2TN2/K5+dl5+PQ0uLFRhb7KhryHjgc4NxgkNKUQ=; b=Ial+DFo4iktFybWv4b1FhwjrMqx1Qeg5nuAlz/OXBLrK1mWnjn/DObtrI6gibXSK6L L9KkY9BKZ/hpLBWu9jO1ZPNcwoNMDx59MNn3Ga//JLB6lg1bpRmxohPpF1UN7hH+e3Vu X8Cu4DRiGmDp5c3r8aVvBR1RiyJuQbWmjAV6QID1ypWByRX3sJWiQeXBI1vkqdJdYVwg mDpkC6Iy5p5lcsS38yNzt2sWb3x/SWLab310O76L9CnRVpYGM6pYazXqd9Gtqp3s6qWs T/E2iaBSZAUOp5kVspEuoctfbep1a2gHbIS2mKNCb/0AZVnGluyaO7z+5i3OtEhxvGud pPMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783302; x=1734388102; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=jRha2TN2/K5+dl5+PQ0uLFRhb7KhryHjgc4NxgkNKUQ=; b=UaXf6/Gzsxl30YCX2NOUeEZfGLKLbKt1WC3KqZ3SG7/Y34dHQ15qI/IXZLIZ3tmF90 avDkTbEcLpqisHu0aq0sXojYdpX3G3E+EhltVoCCPNdcoPOVocqvuE0nvLNOIGvWFngB 1VZuQEZwei/CS7kDt/TrqfLUMvPV+SsLHOZEaBpgHgLZ9DdUagd1I7ezBJogsibogOeQ xO4N/G4hr79VCVm6cHqQlBgPgSpME2JpwJfBZzXFSZRROAfBVw82e+Yz5MiZF0QvklUG EzA1OBSOeRy6khndjI1fQAiBS0XGWg3dpS+8YjPwAhA38g9Jg3Kwr+B/SAr5ckZx5ifD DRgQ== X-Forwarded-Encrypted: i=1; AJvYcCWPW35LmvCPGaz2s7VnApDS0RDOlwGkNp6g220D4mws1+agZngvXxCOUOXwwgtE8jO2ClYMCA4YwSyhmU8=@vger.kernel.org X-Gm-Message-State: AOJu0YzolL8LFc49FyN6hytA68uoTJ7JmmKK14+YwKdV5oQ/P8UseIkv 8565DOmIqdts8GUBUXFHrgjfN5rXSFhtGjHZgLVL3cL2QcPDiLIb4QCeAfXihlWZ8p8nSxVtYmx pEkAUPw== X-Google-Smtp-Source: AGHT+IF/2IGdzvXDwkVWpNcHanTkLxxd8tdMWDoSs4FiusfGHCeufp1Y6CT4epsCSOlwFR1YYW+mFe9e1Owe X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a81:f90e:0:b0:6dd:fda3:6568 with SMTP id 00721157ae682-6efe3cb263fmr215417b3.3.1733783302148; Mon, 09 Dec 2024 14:28:22 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:43 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-7-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 06/22] perf vendor events: Update BroadwellX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v22 to v23. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v23: https://github.com/intel/perfmon/commit/679982113f4bfa16cee19d5408a7f8e309e= 3ac23 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwellx/bdx-metrics.json | 344 ++++++++++-------- .../pmu-events/arch/x86/broadwellx/cache.json | 10 +- .../arch/x86/broadwellx/frontend.json | 4 +- .../arch/x86/broadwellx/memory.json | 6 +- .../arch/x86/broadwellx/metricgroups.json | 5 + .../arch/x86/broadwellx/pipeline.json | 10 +- .../arch/x86/broadwellx/uncore-cache.json | 28 +- .../x86/broadwellx/uncore-interconnect.json | 36 +- .../arch/x86/broadwellx/uncore-memory.json | 1 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 230 insertions(+), 216 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json b/t= ools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json index 0577d7460082..8016202bad1f 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,24 +76,24 @@ "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x19= e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x1= c8\\,filter_tid\\=3D0x3e@ + cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D= 0x180\\,filter_tid\\=3D0x3e@) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -102,14 +102,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -209,13 +209,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_o= pc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=3D= 0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x182@)= ", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_= opc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\= =3D0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x18= 2@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -345,8 +345,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -354,7 +354,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -362,10 +362,10 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", "ScaleUnit": "100%" }, { @@ -376,7 +376,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -385,8 +385,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -394,8 +394,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -404,8 +404,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -414,7 +414,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -422,45 +422,45 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -489,7 +489,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -497,8 +497,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -506,8 +506,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -515,8 +515,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -524,8 +524,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -535,33 +535,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -572,7 +572,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -593,11 +593,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -616,7 +616,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -634,7 +640,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -642,7 +648,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -650,7 +656,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -658,7 +664,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -666,7 +672,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -708,7 +714,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -731,7 +737,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -743,7 +749,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -785,7 +791,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -804,7 +810,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -837,19 +843,19 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -867,14 +873,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -884,13 +890,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -901,18 +908,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", + "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -925,6 +945,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -933,12 +960,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -947,14 +974,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -980,24 +1008,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -1005,8 +1033,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1015,8 +1043,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -1025,8 +1053,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -1034,18 +1062,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1063,18 +1091,18 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -1090,10 +1118,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -1102,7 +1130,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -1114,7 +1142,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -1131,8 +1159,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", "ScaleUnit": "100%" }, { @@ -1141,7 +1169,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -1149,8 +1177,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -1159,7 +1187,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1168,7 +1196,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -1204,7 +1232,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1213,7 +1241,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1231,43 +1259,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1276,8 +1304,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1285,8 +1313,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1305,8 +1333,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1314,16 +1342,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1332,8 +1360,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1341,18 +1369,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1368,7 +1396,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1377,8 +1405,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json b/tools/p= erf/pmu-events/arch/x86/broadwellx/cache.json index beeda41b428a..59bc8c71487e 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -434,7 +434,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -445,7 +445,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -634,7 +634,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -697,7 +697,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json b/tools/= perf/pmu-events/arch/x86/broadwellx/memory.json index 86246f632d79..093c8b564fc0 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -470,7 +470,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json b/= tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json index b55b305aecaa..8f6955615bfd 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json @@ -802,7 +802,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1859,7 +1859,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1869,7 +1869,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2884,7 +2884,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2894,7 +2894,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2904,7 +2904,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2914,7 +2914,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3448,7 +3448,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3458,7 +3458,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3468,7 +3468,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3888,7 +3888,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3898,7 +3898,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3908,7 +3908,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3918,7 +3918,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json index 765d44012bba..56d729c8e946 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json @@ -1,24 +1,4 @@ [ - { - "BriefDescription": "Number of non data (control) flits transmitte= d . Derived from unc_q_txl_flits_g0.non_data", - "Counter": "0,1,2,3", - "EventName": "QPI_CTL_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL= non-data flits transmitted across QPI. This basically tracks the protocol= overhead on the QPI link. One can get a good picture of the QPI-link char= acteristics by evaluating the protocol flits, data flits, and idle/null fli= ts. This includes the header flits for data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x4", - "Unit": "QPI" - }, - { - "BriefDescription": "Number of data flits transmitted . Derived fr= om unc_q_txl_flits_g0.data", - "Counter": "0,1,2,3", - "EventName": "QPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data fli= ts transmitted over QPI. Each flit contains 64b of data. This includes bo= th DRS and NCB data flits (coherent and non-coherent). This can be used to= calculate the data bandwidth of the QPI link. One can get a good picture = of the QPI-link characteristics by evaluating the protocol flits, data flit= s, and idle/null flits. This does not include the header flits that go in = data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x2", - "Unit": "QPI" - }, { "BriefDescription": "Total Write Cache Occupancy; Any Source", "Counter": "0,1", @@ -53,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -63,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -73,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -83,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -93,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -103,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -113,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -123,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json index 45555316f8ea..ca09c1286485 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json @@ -184,6 +184,7 @@ { "BriefDescription": "This event is deprecated. Refer to new event = UNC_M_CLOCKTICKS_P", "Counter": "0,1,2,3", + "Deprecated": "1", "EventName": "UNC_M_DCLOCKTICKS", "PerPkg": "1", "Unit": "iMC" diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index a442ace16a67..d50317fec5e4 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -5,7 +5,7 @@ GenuineIntel-6-C[56],v1.05,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core -GenuineIntel-6-4F,v22,broadwellx,core +GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:58 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC49719E98A for ; Mon, 9 Dec 2024 22:28:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783325; cv=none; b=WE9Qkw8AvpFj68BnAaQIRfDkrFW1DUypr5o/EbQsYrjhYuoamhlYwJVcsxOp8467CVVsSp9IycehZ0x2cpTR2rgb/GOFfbqT29LfmvJTmjapoAAu9M7tdMMZoPhJmHPyAFwO9tLJxGnAdLT34s6Oziz1uwJszQmw69Su+W0QWDI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783325; c=relaxed/simple; bh=unVbfSu09Ki9FYWBWlr3Gbn9tJoqO05ExYZogw1p/yU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=lnZUjPMmc4DFPMpVdYdKp2ESuxngOg66twweiOsDv+QApXwiGT46OK0XxdBrPHMLFsFK/uDbLzCdVayOSvpcCK3q32+BUdjHHM1zgewn71HaKcSpIi0guKSt4aSTfNzMTfea1cm8SmNIQXur5/vqPsRwkWMeTGBFgqqdz1H1D/E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=akyxz3Hz; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="akyxz3Hz" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6eff377fdc1so43215487b3.2 for ; Mon, 09 Dec 2024 14:28:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783305; x=1734388105; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=5D9+dZ/tmJQwpHVkoeVPRoZQH7yu7zSN7SoISBlhe6A=; b=akyxz3HzOi2DKWueBJSNEymmGwSof1wVST32G9WrTfY1c0MgvEwrOV+8WFqPP//bBU 7Flm8PApSkJY/ypFIS4XJjZ9LHMcEqAKxhwckrRKRGe3i+TGE2aDz2k1M7q+dzaxKK2r FPPDUiYtveNa3bmrMEVaa0JUJf4W90c13X25mqhosQT+pMDe4zKlC/WWOR7pnI+3+GFD oQQS2xTcAj0PyjesrB+K9mdUNrS/HDAKBdb1ETmJ0TrvfVI34wI0W3eBKBAG46U+CkRy Rnp4EmxjsDLV03ckecBi5/UdTI9dJK/JiemfQFb879BIzi1Xrsc/HoUes5wolgbVBMdQ kbPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783305; x=1734388105; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=5D9+dZ/tmJQwpHVkoeVPRoZQH7yu7zSN7SoISBlhe6A=; b=exAChs4Pvb0A+hcLM6rWTWae2vFq+p+Bbqmr1IVWmcuecyo2vrX9YHuzhfGa45WLXu pVnIIlehFiaDIjkeyg6KMnMYdUJ0bJ3vyFEJsm5tCKvqnBxmgJboC9N4gZN49CuwwEMy q9LTN4+qxo/Nb5tZ1CvMMDVKBogqN0qOPtudb1HwQm4M3kujMeuWnMopu8i7sePU2GFl rsmZNe/NUHykUGkX+dHh6T4HIaKqUXAF35i3Zfgvsk4jcxZBqOf9DRUHVt7J6m3h5SNb W3shWj2MLgouh1hC23XhqbftgxULs8gzvOQdbw8OlBEPkm18al4ejCgGkODzTtXrlwp6 C5lw== X-Forwarded-Encrypted: i=1; AJvYcCXMYQyd2Jp8Uj5MXd7g5XiVkhNFHy1xY/SGvEfYzYKbnu73tdN0zgHfPdydP6UNPQlQwSIcXJI5MCTpVIo=@vger.kernel.org X-Gm-Message-State: AOJu0YzC6WUSZUs1FZX3VCKR17zNNYtp6aBrYLT1AhrFXF7WGndVbL/4 WQuOVAarNAfkB+xMOssjRkUTF5fmUR53rhNg5ydzGITRQfNvBZBpdQV+J7ceQyODN9z5Ggt3Bj4 R660PQQ== X-Google-Smtp-Source: AGHT+IHUJfXE2vBg7vb9pHIHXkBoGe7j8//dK4vcZ6F+Sz8NYRmaT84Hanp5Fjjeu/KDNs8TZYa7bPDuJVRj X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2d87:b0:6ef:7d06:1eb7 with SMTP id 00721157ae682-6efe3bc8eafmr84387b3.2.1733783304568; Mon, 09 Dec 2024 14:28:24 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:44 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-8-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 07/22] perf vendor events: Update CascadelakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.22 to v1.23. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.23: https://github.com/intel/perfmon/commit/8f3665f6be4688fd1dd1e713ba49ca16ec9= 3b856 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/cascadelakex/clx-metrics.json | 767 ++++++++++-------- .../arch/x86/cascadelakex/metricgroups.json | 9 +- .../arch/x86/cascadelakex/uncore-cache.json | 60 +- .../x86/cascadelakex/uncore-interconnect.json | 11 - tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 5 files changed, 464 insertions(+), 385 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b= /tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index b02a89e14c5d..5729b93a9c68 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,31 +76,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO= _PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART= 2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -109,14 +109,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -192,25 +192,25 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -240,13 +240,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 32@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 31@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -313,12 +313,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -330,7 +330,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -341,7 +341,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -351,9 +351,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -362,7 +455,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -370,8 +463,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -379,8 +472,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -388,18 +481,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.= DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER= _CORE_FWD))) + 44 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRE= D.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DE= MAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + (47.5 * tma_info_= system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -410,25 +535,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_= DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE= + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE /= (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT= _OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -437,18 +562,18 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma= _info_thread_clks - tma_l2_bound - tma_pmm_bound if #has_pmem > 0 else CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und)", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -457,7 +582,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -465,47 +590,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(110 * tma_info_system_core_frequency * (OCR.DEMAND= _RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_= info_system_core_frequency * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.P= F_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Related metrics: = tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data_sha= ring, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -515,7 +640,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -525,17 +650,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -545,7 +670,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -554,7 +679,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -562,17 +687,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -581,8 +706,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -590,8 +715,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -599,7 +724,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -610,50 +735,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -663,7 +788,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -678,8 +803,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -687,7 +812,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -695,108 +820,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latenc= y + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound = + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l1= _hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_l= atency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_st= ore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma= _false_sharing + tma_split_stores + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = - tma_store_latency)) + tma_machine_clears * (1 - tma_other_nukes / tma_oth= er_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -825,7 +852,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -850,14 +877,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -870,20 +897,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -913,7 +940,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -928,11 +961,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -940,7 +973,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -948,7 +981,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -956,7 +989,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -964,7 +997,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -972,7 +1005,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1018,7 +1051,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1028,7 +1061,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1075,7 +1108,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1093,7 +1126,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1135,13 +1168,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1160,7 +1193,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -1216,7 +1249,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1237,18 +1270,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1266,29 +1299,29 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1298,13 +1331,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1322,43 +1356,37 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ= _INSERTS) / imc_0@event\\=3D0x0@ if #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1366,7 +1394,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1374,7 +1402,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1388,6 +1416,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1396,12 +1431,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1410,14 +1445,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1443,43 +1479,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1487,17 +1532,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1505,18 +1550,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1534,7 +1579,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1542,24 +1587,48 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1571,16 +1640,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1588,8 +1657,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1600,11 +1669,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1618,7 +1687,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1626,8 +1695,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1640,12 +1709,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1653,8 +1722,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1663,7 +1732,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1671,7 +1740,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1685,29 +1754,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL= _PMM) > MEM_LOAD_RETIRED.L1_MISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1761,7 +1820,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1770,7 +1829,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1787,8 +1846,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1796,8 +1855,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1805,7 +1864,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1814,35 +1873,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1860,7 +1919,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1869,8 +1928,8 @@ "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: ROB_MISC_EVENTS.P= AUSE_INST", "ScaleUnit": "100%" }, { @@ -1879,8 +1938,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1888,17 +1947,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1906,8 +1965,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1915,18 +1974,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1942,7 +2001,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1950,7 +2009,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1958,7 +2041,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1967,8 +2050,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json index cccfcab3425e..a579603f720b 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -113,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json index 6347eba48810..c9596e18ec09 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json @@ -4577,7 +4577,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4588,7 +4588,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4599,7 +4599,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4609,7 +4609,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4619,7 +4619,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that hit the LLC : Counts the number of entries successfully inserted into = the TOR that match qualifications specified by the subevent. Does not inc= lude addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4630,7 +4630,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4651,7 +4651,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4662,7 +4662,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4673,7 +4673,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4683,7 +4683,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4693,7 +4693,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that missed the LLC : Counts the number of entries successfully inserted in= to the TOR that match qualifications specified by the subevent. Does not = include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4704,7 +4704,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4747,7 +4747,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO ItoM requests that mis= s the LLC. An ItoM request is used by IIO to request a data write without f= irst reading the data for ownership.", "UMask": "0x24", @@ -4759,7 +4759,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RdCur requests and mis= s the LLC. A RdCur request is used by IIO to read data without changing sta= te.", "UMask": "0x24", @@ -4771,7 +4771,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RFO requests that miss= the LLC. A read for ownership (RFO) requests a cache line to be cached in = E state with the intent to modify.", "UMask": "0x24", @@ -4999,7 +4999,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5010,7 +5010,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5021,7 +5021,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -5031,7 +5031,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -5041,7 +5041,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that hit the LLC : For each cycle, this event accumulates the number of v= alid entries in the TOR that match qualifications specified by the subevent= . Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5052,7 +5052,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5073,7 +5073,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5084,7 +5084,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5095,7 +5095,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -5105,7 +5105,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -5115,7 +5115,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that missed the LLC : For each cycle, this event accumulates the number o= f valid entries in the TOR that match qualifications specified by the subev= ent. Does not include addressless requests such as locks and interrupts= .", "UMask": "0x21", @@ -5126,7 +5126,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5171,7 +5171,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO ItoM req= uests that miss the LLC. An ItoM is used by IIO to request a data write wit= hout first reading the data for ownership.", "UMask": "0x24", @@ -5183,7 +5183,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RdCur re= quests that miss the LLC. A RdCur request is used by IIO to read data witho= ut changing state.", "UMask": "0x24", @@ -5195,7 +5195,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RFO requ= ests that miss the LLC. A read for ownership (RFO) requests data to be cach= ed in E state with the intent to modify.", "UMask": "0x24", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnec= t.json b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.js= on index 91889e447bd1..4efbfeb338a6 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.json @@ -13855,16 +13855,5 @@ "PerPkg": "1", "PublicDescription": "Number outstanding register requests within = message channel tracker", "Unit": "UBOX" - }, - { - "BriefDescription": "UPI interconnect send bandwidth for payload. = Derived from unc_upi_txl_flits.all_data", - "Counter": "0,1,2,3", - "EventCode": "0x2", - "EventName": "UPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", - "ScaleUnit": "7.11E-06Bytes", - "UMask": "0xf", - "Unit": "UPI" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index d50317fec5e4..debfa9798ac0 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -6,7 +6,7 @@ GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v23,broadwellx,core -GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core +GenuineIntel-6-55-[56789ABCDEF],v1.23,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B1D819E97B for ; Mon, 9 Dec 2024 22:28:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783338; cv=none; b=sEFjm0AyDwxGSskBEoU5gy8LCkfO+n3AcbEEjKZ9xWu0F0kRFXBF8rMXIcpoR52D372uNigex2l7JTuOcwLFh1QksagtGcdkdW+USKV64cxAcBaqmEh92x4yJVIHDS/2QIdN3Pga9m/Ih0GYntvBvBJT6p9bUft4pxiN30zPqtE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783338; c=relaxed/simple; bh=jPpwPVFMYKoXvuLpEGqTL5s0H/O4ukJBi/WMJu2vEM0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=iFe2olfgvrL/W1IsmqoGwFdDvbgqKyC1UjZ6D9VSvtcKctiH9Gxevjva9J5k1KYO2hQta0Rj5elw6qeLCrZIyYLx1Spi7syJ2GVgfMVliZcsSd8UWsum6SO0lWiQRq+twKwqCxGsG+hoDZrF+GiJ7djvlUt1FIpMDHgylEUgPsU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=10voFEd6; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="10voFEd6" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ef8e4c66ddso52152887b3.1 for ; Mon, 09 Dec 2024 14:28:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783307; x=1734388107; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=TDNiDXXN5EoyB5sJBTev6kU6Stsehe7UcOKwLoQKvNg=; b=10voFEd6Rg9IDp3HY54tz43HFe1avHebyzhbW7/jm5ChRVJK6gCj2TkVbo8U2tl0Rh DGqPTIn4HJko03qGQ+aTkhzsWzNA+mSjeYjikqgYOw2xrTCPWCgfgGn1LxnJ+IlPDlCH li9nDK1c1qOkhev4xltMsJAjhpddxooX3LGWHLgRmzEzpNAV328T1nuUlreQwNJAVIOb 1BqCqg3WHHMkrA6uvag/ITwoioZTqbHZIyWylI+013sTkB2KoNEsTwooSvasOVmhxOoH 0JXVthrs0mC+tcwUlDWTvrHOgpXv9MIdALLDPWida7jmfhUYI82pA49fmt9zrC6sSEQF M0FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783307; x=1734388107; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=TDNiDXXN5EoyB5sJBTev6kU6Stsehe7UcOKwLoQKvNg=; b=PPGTzmeXtxWAczKTlmcG/Nr8uuTgWWswWPtFf7QjRQ/s/59p38w5r9V2AA3204pOGN B1MwLXDzKDH4WFfeVS/Ma/ltXQvT44X0UAJvj3WMsGvg8TttpU49SABTupHTr3GuXyub nWsnqzon5vvu0j4LZabaas+aARiz/jTXBPn1C5MKWuGvVGj4z2x44BH1EeK7oHRPm4AI LgwLQN7OeC64IAQpHFVIFic0qWIG7a7kkrOoVCXMAChpLbQMuXBHFuVFihjYbBreCHmZ VZz89YFbtDlZvB5Pca3uSVZoJWPZ05/e5Z/mBbuSzuwTjwZBhW1vlSdwU0y2rceUjeO5 +5sA== X-Forwarded-Encrypted: i=1; AJvYcCWgFfrhCLTg34LuFgq7QYkd1EfuPtKozyyGzhRVYj+n2vljc2z0MkHj1szuExOYz29xN/+9wsU7MLXD02c=@vger.kernel.org X-Gm-Message-State: AOJu0YzM335vqF8AALs99Q/TqAndZlpwfQpXmzoxSlGIzIbMAwHdVgOw uNR2sW66TqNd+Yx6cuCo34JAbOcBQpK0jsYvSrFQunwDR4jrIFmlzmcZcyXiUbn99braxVvtBZS 0+2AZ4g== X-Google-Smtp-Source: AGHT+IG5XMlOvfsw0PP4whPRTqe8xlX4xoJEiv8BaK9bIh3KOfoTTy235/GraEu+/390ZmyNuvLto9DQlumA X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:4609:b0:6e7:e3e4:9d83 with SMTP id 00721157ae682-6efe3c92178mr186367b3.8.1733783307432; Mon, 09 Dec 2024 14:28:27 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:45 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-9-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 08/22] perf vendor events: Update EmeraldRapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.09 to v1.11. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.11: https://github.com/intel/perfmon/commit/bffcec00a184bb93d505f182047cf889d12= 4fbd5 https://github.com/intel/perfmon/commit/a63da6de48046c365ab91c5001bfd5d907d= 5a1d6 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/emeraldrapids/cache.json | 28 +- .../arch/x86/emeraldrapids/emr-metrics.json | 1036 +++++++++-------- .../arch/x86/emeraldrapids/frontend.json | 19 - .../arch/x86/emeraldrapids/memory.json | 15 +- .../arch/x86/emeraldrapids/metricgroups.json | 10 +- .../arch/x86/emeraldrapids/pipeline.json | 23 - .../arch/x86/emeraldrapids/uncore-io.json | 218 ++-- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 626 insertions(+), 725 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json b/tool= s/perf/pmu-events/arch/x86/emeraldrapids/cache.json index 21d5d96b8a6d..3b0581151d63 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json @@ -92,11 +92,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -311,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -322,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -333,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -344,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -355,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -366,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -377,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -388,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -430,7 +420,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -441,7 +430,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -452,7 +440,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -463,7 +450,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -473,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -484,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -494,7 +478,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -505,7 +488,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -516,7 +498,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -527,7 +508,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -538,7 +518,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -549,7 +528,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -560,7 +538,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -571,7 +548,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json = b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json index ee288099a8d3..73bc185b39da 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,85 +55,73 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU.", - "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", - "MetricName": "iio_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", - "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", - "MetricName": "iio_bandwidth_write", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM / UNC_CHA_TOR_INSE= RTS.IO_ITOM", "MetricName": "io_percent_of_inbound_full_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_MISS_RFO) / (UNC_CHA_TOR_INSERTS.IO_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_RFO)", "MetricName": "io_percent_of_inbound_partial_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_PCIRDCUR / UNC_CHA_TOR_= INSERTS.IO_PCIRDCUR", "MetricName": "io_percent_of_inbound_reads_that_miss_l3", "ScaleUnit": "100%" @@ -142,14 +130,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -230,36 +218,6 @@ "MetricName": "llc_demand_data_read_miss_to_dram_latency", "ScaleUnit": "1ns" }, - { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to Intel(R) Optane(TM) = Persistent Memory(PMEM) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_pmem_latency", - "ScaleUnit": "1ns" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", - "MetricName": "llc_miss_remote_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", @@ -285,19 +243,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -320,24 +278,6 @@ "MetricName": "percent_uops_delivered_from_microcode_sequencer", "ScaleUnit": "100%" }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 6= 4 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -360,7 +300,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -372,7 +312,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -380,12 +320,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -394,35 +334,125 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", - "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", - "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", - "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -430,24 +460,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -455,8 +485,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%" }, { @@ -464,45 +494,92 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_gro= up;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -511,17 +588,17 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "( MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_= clks )", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -530,7 +607,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -538,75 +615,73 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "81 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_gr= oup;tma_frontend_bound_group;tma_issueFB", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", - "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -615,7 +690,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -624,7 +699,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -632,8 +715,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -641,8 +724,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -650,8 +733,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -659,8 +742,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -668,39 +751,37 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", - "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%" }, { @@ -708,40 +789,40 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -755,7 +836,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -769,15 +850,15 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -785,104 +866,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * ( ( tma_memory_bound * ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_bandwidth / ( tma_mem_bandwidth + tma_mem_latency ) ) ) += ( tma_memory_bound * ( tma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_sq_full / ( tma_con= tested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full ) ) )= + ( tma_memory_bound * ( tma_l1_bound / ( tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_fb_full / ( tma_d= tlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock_latency + tma_= split_loads + tma_fb_full ) ) ) )", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * ( ( tma_memory_bound * ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_latency / ( tma_mem_bandwidth + tma_mem_latency ) ) ) + (= tma_memory_bound * ( tma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_dram_bound + tma_store_bound ) ) * ( tma_l3_hit_latency / ( tm= a_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full = ) ) ) + ( tma_memory_bound * tma_l2_bound / ( tma_l1_bound + tma_l2_bound += tma_l3_bound + tma_dram_bound + tma_store_bound ) ) + ( tma_memory_bound *= ( tma_store_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound ) ) * ( tma_store_latency / ( tma_store_latency += tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_sto= re ) ) ) + ( tma_memory_bound * ( tma_l1_bound / ( tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_l1_hit_la= tency / ( tma_dtlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock= _latency + tma_split_loads + tma_fb_full ) ) ) )", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_big_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + cpu@RS.EMPTY\\,um= ask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_0) / (tma_amx_busy += tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * ( tma_memory_bound * ( tma_l1_bound / max( tm= a_memory_bound , ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bo= und + tma_store_bound ) ) ) * ( tma_dtlb_load / max( tma_l1_bound , ( tma_d= tlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock_latency + tma_= split_loads + tma_fb_full ) ) ) + ( tma_memory_bound * ( tma_store_bound / = ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_b= ound ) ) * ( tma_dtlb_store / ( tma_store_latency + tma_false_sharing + tma= _split_stores + tma_streaming_stores + tma_dtlb_store ) ) ) )", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * ( tma_memory_bound * ( ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_latency / ( tma_mem_bandwidth + tma_mem_latency ) ) * tma= _remote_cache / ( tma_local_mem + tma_remote_mem + tma_remote_cache ) + ( t= ma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound= + tma_store_bound ) ) * ( tma_contested_accesses + tma_data_sharing ) / ( = tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_ful= l ) + ( tma_store_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_dram_bound + tma_store_bound ) ) * tma_false_sharing / ( ( tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store ) - tma_store_latency ) ) + tma_machine_clears * ( 1 - tma_other_nu= kes / ( tma_other_nukes ) ) )", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -943,11 +930,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -960,20 +947,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1002,15 +989,21 @@ "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1028,7 +1021,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1036,7 +1029,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1044,7 +1037,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1052,7 +1045,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1060,7 +1053,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1068,7 +1061,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1076,7 +1069,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1121,7 +1114,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1131,7 +1124,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1178,7 +1171,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1196,7 +1189,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1238,13 +1231,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1263,12 +1256,12 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1321,26 +1314,33 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", - "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / duration_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / duration_tim= e", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / duratio= n_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1369,7 +1369,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1390,18 +1390,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1415,7 +1415,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1433,28 +1433,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / d= uration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / t= ma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1464,13 +1464,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1481,44 +1482,45 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@ if #has_pme= m > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1528,10 +1530,17 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "cha_0@event\\=3D0x0@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1540,7 +1549,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -1551,7 +1560,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1560,14 +1569,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1577,13 +1587,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1599,7 +1609,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9" + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", @@ -1607,7 +1625,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%" }, { @@ -1615,8 +1633,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1624,8 +1642,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1633,26 +1651,26 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1660,8 +1678,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1669,17 +1696,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1687,19 +1714,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "DefaultMetricgroupName": "TopdownL2", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1716,7 +1742,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1724,53 +1750,76 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", - "MetricGroup": "BadSpec;BvMS;Default;MachineClears;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1778,32 +1827,31 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1816,7 +1864,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%" }, { @@ -1824,8 +1872,8 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1838,21 +1886,29 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%" }, { @@ -1861,7 +1917,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1869,7 +1925,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1883,19 +1939,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1904,7 +1960,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%" }, { @@ -1913,7 +1969,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -1922,7 +1978,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1931,25 +1987,25 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,um= ask\\=3D0xc@)) / tma_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.= STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL = + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info= _thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(cpu@RS.EMPTY\= \,umask\\=3D1@ - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (= CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_threa= d_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1957,7 +2013,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1967,8 +2023,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%" }, { @@ -1977,36 +2033,35 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", - "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" }, @@ -2015,7 +2070,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2024,8 +2079,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%" }, { @@ -2034,7 +2089,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2043,8 +2098,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -2052,17 +2107,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -2070,8 +2125,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -2079,17 +2134,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2106,7 +2161,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -2114,7 +2169,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -2122,7 +2201,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2131,7 +2210,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2140,29 +2219,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of cycles in aborted transactions.= ", - "MetricExpr": "(max(cycles\\-t - cycles\\-ct, 0) / cycles if has_e= vent(cycles\\-t) else 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_aborted_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Number of cycles within a transaction divided= by the number of transactions.", - "MetricExpr": "(cycles\\-t / tx\\-start if has_event(cycles\\-t) e= lse 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_cycles_per_transaction", - "ScaleUnit": "1cycles / transaction" - }, - { - "BriefDescription": "Percentage of cycles within a transaction reg= ion.", - "MetricExpr": "(cycles\\-t / cycles if has_event(cycles\\-t) else = 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_transactional_cycles", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { @@ -2171,12 +2229,6 @@ "MetricName": "uncore_frequency", "ScaleUnit": "1GHz" }, - { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data r= eceive bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_RxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", - "MetricName": "upi_data_receive_bw", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json b/t= ools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json index f6e3e40a3b20..bf68493d4509 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, @@ -244,7 +227,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -256,7 +238,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json b/too= ls/perf/pmu-events/arch/x86/emeraldrapids/memory.json index 2ea19539291b..41d4120d4dae 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json @@ -63,7 +63,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -76,7 +75,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -89,7 +87,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -102,7 +99,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -115,7 +111,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +123,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -141,7 +135,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -154,7 +147,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -167,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -178,7 +169,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -305,17 +295,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json= b/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json index e1de6c2675c4..9129fb7b7ce4 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json @@ -40,6 +40,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -86,7 +87,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -96,6 +99,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", @@ -116,10 +120,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", @@ -133,5 +140,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json index 5d5811f26151..50cacfbbc7cf 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json @@ -62,7 +62,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -71,7 +70,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -81,7 +79,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -91,7 +88,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -101,7 +97,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -111,7 +106,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -121,7 +115,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -131,7 +124,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -141,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -151,7 +142,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -160,7 +150,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -170,7 +159,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -180,7 +168,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -190,7 +177,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -200,7 +186,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -210,7 +195,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -220,7 +204,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -469,7 +452,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -479,7 +461,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -488,7 +469,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -497,7 +477,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -506,7 +485,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -516,7 +494,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json b/= tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json index 91013ced74aa..94340dee1c9c 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json @@ -79,86 +79,6 @@ "UMask": "0x27", "Unit": "iio_free_running" }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "9", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART0_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x30", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "10", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART1_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x31", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "11", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART2_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x32", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "12", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART3_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x33", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "13", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART4_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x34", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "14", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART5_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x35", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "15", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART6_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x36", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "16", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART7_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x37", - "Unit": "iio_free_running" - }, { "BriefDescription": "IIO Clockticks", "Counter": "0,1,2,3", @@ -201,7 +121,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -214,7 +134,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -227,7 +147,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 2", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -240,7 +160,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -253,7 +173,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 4", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -266,7 +186,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 5", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -279,7 +199,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 6", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +212,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 7", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +236,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -329,7 +249,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -342,7 +262,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -355,7 +275,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -368,7 +288,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -381,7 +301,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -394,7 +314,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -407,7 +327,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -431,7 +351,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 c= ard is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -443,7 +363,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -455,7 +375,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -467,7 +387,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -479,7 +399,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 ca= rd is plugged in to slot 0", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -491,7 +411,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 1", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -503,7 +423,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +435,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 3", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -662,7 +582,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -675,7 +595,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -688,7 +608,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -701,7 +621,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -714,7 +634,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -727,7 +647,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -740,7 +660,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -753,7 +673,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -766,7 +686,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -779,7 +699,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -792,7 +712,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +725,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -818,7 +738,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -831,7 +751,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -844,7 +764,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -857,7 +777,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -974,7 +894,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1082,7 +1001,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1299,7 +1217,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Passing data= to be written : How often different queues (e.g. channel / fc) ask to send= request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1351,7 +1269,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Request Owne= rship : How often different queues (e.g. channel / fc) ask to send request = into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1364,7 +1282,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Writing line= : How often different queues (e.g. channel / fc) ask to send request into = pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1377,7 +1295,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Pass= ing data to be written : How often different queues (e.g. channel / fc) are= allowed to send request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1390,7 +1308,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing final read or write of line : How often different queues (e.g. channel = / fc) are allowed to send request into pipeline", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1403,7 +1321,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Proc= essing response from IOMMU : How often different queues (e.g. channel / fc)= are allowed to send request into pipeline", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1416,7 +1334,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing to IOMMU : How often different queues (e.g. channel / fc) are allowed t= o send request into pipeline", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1429,7 +1347,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Requ= est Ownership : How often different queues (e.g. channel / fc) are allowed = to send request into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1442,7 +1360,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Writ= ing line : How often different queues (e.g. channel / fc) are allowed to se= nd request into pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1962,7 +1880,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : PCIe Request complete : = Only for posted requests : Each PCIe request is broken down into a series o= f cacheline granular requests and each cacheline size request may need to m= ake multiple passes through the pipeline (e.g. for posted interrupts or mul= ti-cast). Each time a single PCIe request completes all its cacheline gra= nular requests, it advances pointer.", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1975,7 +1893,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Writing line : Only for = posted requests : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1988,7 +1906,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Issuing final read or wr= ite of line : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2001,7 +1919,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Passing data to be writt= en : Only for posted requests : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2014,7 +1932,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Passing dat= a to be written : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -2026,7 +1944,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2039,7 +1957,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Request Own= ership : Only for posted requests", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -2052,7 +1970,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Writing lin= e : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2065,7 +1983,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Passing data = to be written : Each PCIe request is broken down into a series of cacheline= granular requests and each cacheline size request may need to make multipl= e passes through the pipeline (e.g. for posted interrupts or multi-cast). = Each time a cacheline completes a single pass (e.g. posts a write to singl= e multi-cast target) it advances state : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -2091,7 +2009,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Request Owner= ship : Each PCIe request is broken down into a series of cacheline granular= requests and each cacheline size request may need to make multiple passes = through the pipeline (e.g. for posted interrupts or multi-cast). Each tim= e a cacheline completes a single pass (e.g. posts a write to single multi-c= ast target) it advances state : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2104,7 +2022,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Writing line = : Each PCIe request is broken down into a series of cacheline granular requ= ests and each cacheline size request may need to make multiple passes throu= gh the pipeline (e.g. for posted interrupts or multi-cast). Each time a c= acheline completes a single pass (e.g. posts a write to single multi-cast t= arget) it advances state : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2309,7 +2227,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 0/1/2/3, Or x8 card plugged in to Lane= 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2322,7 +2240,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2335,7 +2253,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 2= ", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2348,7 +2266,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2361,7 +2279,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 4/5/6/7, Or x8 card plugged in to Lane= 4/5, Or x4 card is plugged in to slot 4", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2374,7 +2292,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 5", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2387,7 +2305,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 6/7, Or x4 card is plugged in to slot 6= ", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2400,7 +2318,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 7", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index debfa9798ac0..04a99e4767ef 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -8,7 +8,7 @@ GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.23,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core -GenuineIntel-6-CF,v1.09,emeraldrapids,core +GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.03,grandridge,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E013B19F13F for ; Mon, 9 Dec 2024 22:28:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783320; cv=none; b=PXpWsRLJh72VVTnIyFfjLMXzP088nmrXZYhiqQyqIkjC4FMAz90VFLT8IStUTMOQeX8vJtkc1xZyFP2NCGjVe+r839D1Ddcy0f6hqs37xdgNlV33lJTjon5V5jllGodWTWzDApTYDaVDkmEWVDdol3ZoM2cqfOtbeG7vBkbNXA8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783320; c=relaxed/simple; bh=hO/nZ+3eTsTlQrSdYfYxHxGTA4WB76R+WX0NGsAIisA=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=oxUkC2m4pj7sG1/VfRiuiDMwKmDVdhAyq/bqf1/UUVpbd9UvRmg0jpvMoIUEUAu0x2fEi8sqLjd8e/gKqMzByMtVusixbDk2pzZWaasy0oJ/xCg4POT/dl3W4aeQVEY1qy2K9kgafwsiibNMvjN8aeyO1/i54r34rzs4pkifSsc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0hl1mQPN; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0hl1mQPN" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ef7c67eeb8so46229307b3.3 for ; Mon, 09 Dec 2024 14:28:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783310; x=1734388110; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=7r63EyygyG+quKYZklwXpj3EqjAm/6RL4CFWh9W+pCI=; b=0hl1mQPNg6570eUuuiqZgEHcvd6ESEbrm23/DhDhLPDrdD9hIvF7SeqGuDqN1qWB6U tVWUfQ2fXqVCXRkohR6FFGSzCKU+vnpzYImaCEWisuIoXWr8FidP9DY0HM6fB3iow4qX D/92FdFBOkX5z67StJgvyVYsO1DVVYP2i7P8S6zWu6ZaVw4RIiJoo/douMOU6c65SaaW ma1Bu2KrieTQZG4BJhE1ZQ5wssllTR5LaOXafWrqUkRyRY7apH7pXM1ejvUMSUj5bh0G zWxX1PWsFGsLl/Q2LaE8mIeHEm4syPxrsOTyjnRHLambeijONWwoNhrYSD90L45HrHOl 6mng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783310; x=1734388110; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=7r63EyygyG+quKYZklwXpj3EqjAm/6RL4CFWh9W+pCI=; b=mULYTqUDdUSFRnlmAyEXVdDE45VDWfQ0pfZvFf8Z5gHBTZeZIKigonJLo6usxOkIDx UXYGwtkehLH3R44yk2y4pSEovFEaCHwsGgA6eD7a+0OPCdJp319u6KsEDEaMoYLAEvya rNA/pKAsx4d3fo9drf8OfBeQMEwE1pgBgIuN4b/mrlGxtkcH/kIAY5NVYto5gFmAMUqi SiGDctt0RozB22wpagHIZeb4jGY4nYkTDEH4x/JIjcB8xF/k/Ujn4Y9VmAl3uCt2vn2O XvwQN7eWARYKDH3xa7OAyYnZwEhmQl4yh1IH5HXNYhCHtRZ8Dcru5N2avePbfWrSB+HH WE8A== X-Forwarded-Encrypted: i=1; AJvYcCXEwPXsuBQX/0RGBJvFQVPybeD/3DF6kSsXSj7NntqmtLsRXD91CBPOXK6HbICdvLe6/mq7shmNhyl9WqY=@vger.kernel.org X-Gm-Message-State: AOJu0Yzu8MS96ConFnmTYCs7LGETKtb8nCtaHmD99430ERgm2J8xF2Ro DcBb2lEOWZwYafE3toz7V14CWcoorpl903ITfOOWtAEEckUQ900IK+CSbLXlKfQqrMCPN8rCDMe fv9PV7g== X-Google-Smtp-Source: AGHT+IG98a4d4SqWQy0MsDOQpoAl2vorCcM3GnIiH/Ribq6IZiEiComRn7DasWh98oMKEgxlwomoBN42OIT8 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2891:b0:6ef:e0ce:df9f with SMTP id 00721157ae682-6efe3c68ec1mr240647b3.4.1733783309794; Mon, 09 Dec 2024 14:28:29 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:46 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-10-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 09/22] perf vendor events: Update GrandRidge events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.03 to v1.05. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.05: https://github.com/intel/perfmon/commit/3b2e3528fbfb5576f443607ac9d772de88a= ed72c https://github.com/intel/perfmon/commit/9bc1815536ff1f6fe73693a19a410b6a711= 740c2 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/grandridge/grr-metrics.json | 284 ++++++------------ .../arch/x86/grandridge/pipeline.json | 3 +- .../arch/x86/grandridge/uncore-cache.json | 4 +- .../x86/grandridge/uncore-interconnect.json | 60 ++++ .../arch/x86/grandridge/uncore-io.json | 214 ++++++------- .../arch/x86/grandridge/uncore-memory.json | 2 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 260 insertions(+), 309 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json b/t= ools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json index 07e542297e93..2f9959c61718 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json @@ -1,4 +1,11 @@ [ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, { "BriefDescription": "C1 residency percent per core", "MetricExpr": "cstate_core@c1\\-residency@ / TSC", @@ -7,17 +14,24 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per module", - "MetricExpr": "cstate_module@c6\\-residency@ / TSC", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Module_Residency", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { @@ -28,7 +42,21 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -49,31 +77,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -82,14 +110,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -188,17 +216,15 @@ { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", "MetricExpr": "tma_core_bound", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", "MetricExpr": "TOPDOWN_BE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -206,104 +232,92 @@ { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", "MetricExpr": "TOPDOWN_BAD_SPECULATION.ALL_P / (6 * CPU_CLK_UNHALT= ED.CORE)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", - "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS)", "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (6 * CPU_CLK_UNHALTED.CORE)= ", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls", "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (6 * CPU_CLK_UNH= ALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls", "MetricExpr": "TOPDOWN_FE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses", "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (6 * CPU_CLK_UN= HALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -331,32 +345,6 @@ "MetricGroup": "Flops", "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp" }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "tma_info_bottleneck_dtlb_miss_bound_cycles", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "tma_info_bottleneck_ifetch_miss_bound_cycles", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "tma_info_bottleneck_load_miss_bound_cycles", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "tma_info_bottleneck_mem_exec_bound_cycles", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound" - }, { "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", "MetricExpr": "100 * (LD_HEAD.DTLB_MISS_AT_RET + LD_HEAD.PGWALK_AT= _RET) / CPU_CLK_UNHALTED.CORE", @@ -438,21 +426,6 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio" }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "tma_info_buffer_stalls_load_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "tma_info_buffer_stalls_mem_rsv_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "tma_info_buffer_stalls_store_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles" - }, { "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.LD_BUF / CPU_CLK_UNHALTED= .CORE", @@ -474,7 +447,8 @@ { "BriefDescription": "Cycles Per Instruction", "MetricExpr": "CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY", - "MetricName": "tma_info_core_cpi" + "MetricName": "tma_info_core_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Floating Point Operations Per Cycle", @@ -492,21 +466,6 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL_P / INST_RETIRED.ANY", "MetricName": "tma_info_core_upi" }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l2h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.LLC_MISS / MEM_BOUND_= STALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss" - }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.L2_HIT / MEM_BOUND_ST= ALLS_IFETCH.ALL", @@ -519,24 +478,6 @@ "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l2hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l3hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.LLC_MISS / MEM_BOUND_ST= ALLS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s" - }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.L2_HIT / MEM_BOUND_STAL= LS_LOAD.ALL", @@ -584,16 +525,6 @@ "MetricExpr": "1e3 * MACHINE_CLEARS.SMC / INST_RETIRED.ANY", "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki" }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_adressaliasing", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_storefwdblk", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk" - }, { "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", "MetricExpr": "100 * LD_BLOCKS.ADDRESS_ALIAS / MEM_UOPS_RETIRED.AL= L_LOADS", @@ -606,31 +537,6 @@ "MetricName": "tma_info_mem_exec_blocks_loads_with_storefwdblk", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_l1miss", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_otherpipeline= blks", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_pagewalk", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_stlbhit", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_storefwding", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding" - }, { "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", "MetricExpr": "100 * LD_HEAD.L1_MISS_AT_RET / LD_HEAD.ANY_AT_RET", @@ -686,11 +592,6 @@ "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / TOPDOWN_RETIRING= .ALL_P", "MetricName": "tma_info_mem_mix_memload_ratio" }, - { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "tma_info_serialization_tpause_cycles", - "MetricName": "tma_info_serialization _%_tpause_cycles" - }, { "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", "MetricExpr": "100 * SERIALIZATION.C01_MS_SCB / (6 * CPU_CLK_UNHAL= TED.CORE)", @@ -711,14 +612,17 @@ }, { "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE_P@k / CPU_CLK_UNHALTED.CO= RE", - "MetricGroup": "Summary", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P:k / CPU_CLK_UNHALTED.CORE", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization" }, { @@ -742,102 +646,90 @@ "MetricName": "tma_info_uop_mix_x87_uop_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses", "MetricExpr": "TOPDOWN_FE_BOUND.ITLB_MISS / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (6 * CPU_C= LK_UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (6 * CPU_CLK_UNHALTE= D.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized", "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes", "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (6 * CPU_CLK_UNHALTED.C= ORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that result = in retirement slots", "MetricExpr": "TOPDOWN_RETIRING.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json b/tool= s/perf/pmu-events/arch/x86/grandridge/pipeline.json index b67c0c89054d..40fa4f5ae261 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -185,7 +185,6 @@ "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json index 1eaf796601b1..6a80cf6cbd36 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json @@ -573,7 +573,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -593,7 +593,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt from local IA that miss the cac= he", + "BriefDescription": "Data read opt from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json index 6aaca4039107..2c18767511f3 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json @@ -183,6 +183,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -193,6 +213,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json b/too= ls/perf/pmu-events/arch/x86/grandridge/uncore-io.json index 33fc7b835abf..c5b05c71c56d 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -232,7 +232,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -244,7 +244,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -256,7 +256,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -268,7 +268,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -280,7 +280,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +292,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -304,7 +304,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +316,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -328,7 +328,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -339,7 +339,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -350,7 +350,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -361,7 +361,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -372,7 +372,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -383,7 +383,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -394,7 +394,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -405,7 +405,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -416,7 +416,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -427,7 +427,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -438,7 +438,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -449,7 +449,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -460,7 +460,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -471,7 +471,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -482,7 +482,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -493,7 +493,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -504,7 +504,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +515,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -526,7 +526,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -537,7 +537,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -548,7 +548,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -559,7 +559,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -570,7 +570,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -581,7 +581,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -592,7 +592,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -603,7 +603,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -615,7 +615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -639,7 +639,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -651,7 +651,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -663,7 +663,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -675,7 +675,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -687,7 +687,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -699,7 +699,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -832,7 +832,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -844,7 +844,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -856,7 +856,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -868,7 +868,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -880,7 +880,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -892,7 +892,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -904,7 +904,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -925,7 +925,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -936,7 +936,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -947,7 +947,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -958,7 +958,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -969,7 +969,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -980,7 +980,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -991,7 +991,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1002,7 +1002,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1013,7 +1013,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1024,7 +1024,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1035,7 +1035,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1046,7 +1046,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1057,7 +1057,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1068,7 +1068,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1079,7 +1079,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1090,7 +1090,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1101,7 +1101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1112,7 +1112,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1123,7 +1123,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1134,7 +1134,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1145,7 +1145,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1156,7 +1156,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1167,7 +1167,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1178,7 +1178,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1200,7 +1200,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1211,7 +1211,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1222,7 +1222,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1233,7 +1233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1244,7 +1244,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1255,7 +1255,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1266,7 +1266,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1278,7 +1278,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1290,7 +1290,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1302,7 +1302,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1314,7 +1314,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1326,7 +1326,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1338,7 +1338,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1350,7 +1350,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json index 7e6e6764f181..e75b3050ccd5 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 04a99e4767ef..1a28f9d70e74 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -11,7 +11,7 @@ GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core -GenuineIntel-6-B6,v1.03,grandridge,core +GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.02,graniterapids,core GenuineIntel-6-(3C|45|46),v35,haswell,core GenuineIntel-6-3F,v28,haswellx,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D3E3A19F116 for ; Mon, 9 Dec 2024 22:28:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783353; cv=none; b=gnGxY5FsXhwFab7gzOuubff0wskLCDJFZnw9V0Ak/jbtOJXcGdzePgt8/jTf/oifs0lxrPE7f15ufVHYgsNCrT0YkZ2r/VcTLFiZmrrEYhZXJSnA4o/abw2wdUCZs4CXOFZQMEWz90xKhxiyvRCpoI82xaHwUiao/HhUJLNIEwQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783353; c=relaxed/simple; bh=LEORHQorcgAjY32uFFB1nopOiKJPsh69I85KOkmDX4w=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=WoVadlP+bicadbSIB6oysn9TMvvZMSHwg6XwBs7mAPOXumtEa9MQesXhYdy5OvIuDWjAeIH0+49fJR10S/+H7tQ7DVu03JnPB4FE5dcyFy+R9fhAF2lVHV9+njFBZ19q8pQHUvTNpc+cxD26Fy9aY3LdcYhTmLta8BUH60pi1EI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=so0LzTA7; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="so0LzTA7" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6efe401aad3so64853387b3.0 for ; Mon, 09 Dec 2024 14:28:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783312; x=1734388112; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=ik4MOhA+5OZpRpfloCrXs7j75Z0W4cKT+BrNpOJ+h6w=; b=so0LzTA7YMXDB215yCr/MgTHOrqMgXCeFjiJGz370kY4hFExrLCaNmuirEL38Z6w2m NNUcP7SaO9fzSs+yW0YuFmUcPfl+lQM9LIhwCm+XH5eBWQeOMNKBPbLudYAjdCECssae f5J1e9U6/s4aIjjpRXg8Vp0aZzS7uH4wUtGyJKzqsENm3BR+ElIbEd0H+UIbgfHivZ8a zCgInEEQAxPBsazggNkpDKyK9AX8l4whwmmpzuudwTjAKCmjU0Eur3+zTSl0LxLViEep dJNl1BiH6TcAQQAEZHBxslOPgqNVUeknOihQixulcA3rj5/4OR0u2tJslYAU07UfAF8E Lv/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783312; x=1734388112; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=ik4MOhA+5OZpRpfloCrXs7j75Z0W4cKT+BrNpOJ+h6w=; b=kbOIBe1mj5in6traZhqQ8rfb3YYP1lc+SaD3FIcvHYYrGxwDlTjf4YBVWOT+nIUWfw x+VkoXmewz2BE7UvHk5EHImYnISaj454hwgGHvGdgY/bQkmzxSdAbxHtNIPfp3UOGABI 4NAsE8S6Yh5pwsTVlx1U4mmaJFogfgjljMUMsYfnkcFXs+ql+dWomqkGY6ZUhjBl2IAy dEWJQraK578Yhl5UOo7J4wV4E9u9Y3GR0e8pIqszhpQ83j1ENHkMoH5wDvlmPP85lh9f 3h5gU9Hlto1DjejNl2ySSSqYmVhUHyYupyw7nd6O9x5qeaTXCulOFhFQ+diIs2nO0gIG wwjg== X-Forwarded-Encrypted: i=1; AJvYcCXs9KDuvFPD0hnVe1iLFpciDo1eUfRWJxBmhMpxTEmCjoqMmJuQlKVURO0MsS9AdBlP1iWOO7nh4fjkA4Y=@vger.kernel.org X-Gm-Message-State: AOJu0YxnkLj26NHS60cuBtNn/sw860M0euaIotCKhoKsVFiaINvp9q0e P8NQF5SHBlnrwG5cCJMA2wMWIAzEhVKBi5MrQqqI6J1kj5q8+hvl3ULhX1dw0mVu837kxFa1LcI ip4nH3A== X-Google-Smtp-Source: AGHT+IF7KFAvjTRW44wKK+oypuseXsxw5zHaU2qMo1ICgQ9s1/0INEaEiMK8sX2FnfT5UTUJ8zbFgOWzVBhV X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:803:b0:6ef:70fb:459b with SMTP id 00721157ae682-6efe3ac4470mr678537b3.0.1733783312179; Mon, 09 Dec 2024 14:28:32 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:47 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-11-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 10/22] perf vendor events: Update/add Graniterapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.02 to v1.04. Add TMA metrics 5.01. Bring in the event updates v1.04: https://github.com/intel/perfmon/commit/79b9e512eab58641941a0b8d10ffe75914a= 87e17 https://github.com/intel/perfmon/commit/bc74a895e461b5ac720559da667e83a8fed= f7829 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/graniterapids/cache.json | 110 +- .../arch/x86/graniterapids/counter.json | 2 +- .../arch/x86/graniterapids/frontend.json | 24 +- .../arch/x86/graniterapids/gnr-metrics.json | 2311 +++++++++++++++++ .../arch/x86/graniterapids/memory.json | 121 +- .../arch/x86/graniterapids/metricgroups.json | 145 ++ .../arch/x86/graniterapids/other.json | 99 + .../arch/x86/graniterapids/pipeline.json | 40 +- .../arch/x86/graniterapids/uncore-cache.json | 14 +- .../graniterapids/uncore-interconnect.json | 87 + .../arch/x86/graniterapids/uncore-io.json | 280 +- .../arch/x86/graniterapids/uncore-memory.json | 122 +- .../arch/x86/graniterapids/uncore-power.json | 98 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 14 files changed, 3212 insertions(+), 243 deletions(-) create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/gnr-metric= s.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/metricgrou= ps.json diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/cache.json index b56066274813..37cd575eee9e 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/cache.json @@ -83,11 +83,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -320,7 +320,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -331,7 +330,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -342,7 +340,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -353,7 +350,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -364,7 +360,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -375,7 +370,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -386,7 +380,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x9" @@ -397,7 +390,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0xa" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -439,7 +429,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -450,7 +439,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -461,7 +449,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -472,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -483,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -494,7 +479,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -504,7 +488,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -515,7 +498,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -525,7 +507,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -536,7 +517,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -547,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -558,7 +537,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -569,7 +547,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -580,7 +557,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -591,7 +567,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -602,7 +577,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -664,6 +638,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that resulted in a s= noop that hit in another core, which did not forward the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that resulted in a s= noop hit in another core's caches which forwarded the unmodified data to th= e requesting core.", "Counter": "0,1,2,3", @@ -674,6 +658,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit a modified line in another c= ore's caches which forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_CACHE.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1030000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit in another core's caches whi= ch forwarded the unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_CACHE.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x830000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -704,6 +708,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that resulted in a snoop hit a modified line in another core's caches w= hich forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1830004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop hit a mo= dified line in another core's caches which forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1030004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop hit in a= nother core's caches which forwarded the unmodified data to the requesting = core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x830004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO), hard= ware prefetch RFOs (which bring data to L2), and software prefetches for ex= clusive ownership (PREFETCHW) that hit to a (M)odified cacheline in the L3 = or snoop filter.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.RFO_TO_CORE.L3_HIT_M", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F80040022", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Any memory transaction that reached the SQ.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json b/to= ols/perf/pmu-events/arch/x86/graniterapids/counter.json index 250781a8ca64..4e95dd19aa19 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/counter.json @@ -52,7 +52,7 @@ { "Unit": "PCU", "CountersNumFixed": "0", - "CountersNumGeneric": 4 + "CountersNumGeneric": "4" }, { "Unit": "CHACMS", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/frontend.json index 663c1a0e55a2..dc81055941b1 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", "SampleAfterValue": "100007", "UMask": "0x3" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -244,8 +227,7 @@ "EventCode": "0xc6", "EventName": "FRONTEND_RETIRED.LATE_SWPF", "MSRIndex": "0x3F7", - "MSRValue": "0x9", - "PEBS": "1", + "MSRValue": "0xA", "PublicDescription": "Number of Instruction Cache demand miss in s= hadow of an on-going i-fetch cache-line triggered by PREFETCHIT0/1 instruct= ions", "SampleAfterValue": "100007", "UMask": "0x3" @@ -257,7 +239,6 @@ "EventName": "FRONTEND_RETIRED.MISP_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "ANT retired branches that got just mispredic= ted", "SampleAfterValue": "100007", "UMask": "0x2" @@ -269,7 +250,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -280,7 +260,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -292,7 +271,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3" }, diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json = b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json new file mode 100644 index 000000000000..2cccb7070d44 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json @@ -0,0 +1,2311 @@ +[ + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", + "MetricName": "cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Percent of time that cores are in cstate C0 a= s observed by the power control unit (PCU)", + "MetricExpr": "100 * (UNC_P_POWER_STATE_OCCUPANCY_CORES_C0 / (pcu_= 0@UNC_P_CLOCKTICKS@ / #num_packages) / (#num_cores / #num_packages * #num_p= ackages))", + "MetricName": "cpu_cstate_c0" + }, + { + "BriefDescription": "Percent of time that cores are in cstate C6 a= s observed by the power control unit (PCU)", + "MetricExpr": "100 * (UNC_P_POWER_STATE_OCCUPANCY_CORES_C6 / (pcu_= 0@UNC_P_CLOCKTICKS@ / #num_packages) / (#num_cores / #num_packages * #num_p= ackages))", + "MetricName": "cpu_cstate_c6" + }, + { + "BriefDescription": "CPU operating frequency (in GHz)", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC = * #SYSTEM_TSC_FREQ / 1e9", + "MetricName": "cpu_operating_frequency", + "ScaleUnit": "1GHz" + }, + { + "BriefDescription": "Percentage of time spent in the active CPU po= wer state C0", + "MetricExpr": "tma_info_system_cpus_utilized", + "MetricName": "cpu_utilization", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricName": "dtlb_2nd_level_load_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", + "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricName": "dtlb_2nd_level_store_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU", + "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", + "MetricName": "iio_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", + "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", + "MetricName": "iio_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", + "MetricName": "io_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", + "MetricName": "io_bandwidth_read_local", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", + "MetricName": "io_bandwidth_read_remote", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write_local", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write_remote", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", + "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricName": "itlb_2nd_level_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read requests missing= in L1 instruction cache (includes prefetches) to the total number of compl= eted instructions", + "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", + "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of demand load requests hitti= ng in L1 data cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", + "MetricName": "l1d_demand_data_read_hits_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of requests missing L1 data c= ache (includes data+rfo w/ prefetches) to the total number of completed ins= tructions", + "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", + "MetricName": "l1d_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read request missing = L2 cache to the total number of completed instructions", + "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricName": "l2_demand_code_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed demand load requ= ests hitting in L2 cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", + "MetricName": "l2_demand_data_read_hits_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed data read reques= t missing L2 cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricName": "l2_demand_data_read_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of requests missing L2 cache = (includes code+data+rfo w/ prefetches) to the total number of completed ins= tructions", + "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", + "MetricName": "l2_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD / INST_RETIRED.ANY", + "MetricName": "llc_code_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_= TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETI= RED.ANY", + "MetricName": "llc_data_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) in nano seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_= OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_latency", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to local memory in nano= seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL / UN= C_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL) / (UNC_CHA_CLOCKTICKS / (source_count(= UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_latency_for_local_request= s", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to remote memory in nan= o seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE / U= NC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE) / (UNC_CHA_CLOCKTICKS / (source_coun= t(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE) * #num_packages)) * duration_ti= me", + "MetricName": "llc_demand_data_read_miss_latency_for_remote_reques= ts", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to DRAM in nano seconds= ", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_to_dram_latency", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", + "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", + "MetricName": "llc_miss_local_memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", + "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", + "MetricName": "llc_miss_local_memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", + "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", + "MetricName": "llc_miss_remote_memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", + "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", + "MetricName": "llc_miss_remote_memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", + "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", + "MetricName": "loads_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "DDR memory read bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH1.RD)= * 64 / 1e6 / duration_time", + "MetricName": "memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "DDR memory bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH1.RD = + UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR) * 64 / 1e6 / duration_= time", + "MetricName": "memory_bandwidth_total", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "DDR memory write bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR)= * 64 / 1e6 / duration_time", + "MetricName": "memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", + "MetricName": "numa_reads_addressed_to_local_dram", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", + "MetricName": "numa_reads_addressed_to_remote_dram", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from decoded instruction cache= (decoded stream buffer or DSB) as a percent of total uops delivered to Ins= truction Decode Queue", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_decoded_icache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from legacy decode pipeline (M= icro-instruction Translation Engine or MITE) as a percent of total uops del= ivered to Instruction Decode Queue", + "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ= .MS_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from microcode sequencer (MS) = as a percent of total uops delivered to Instruction Decode Queue", + "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.M= S_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_microcode_sequencer", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "The ratio of number of completed memory store= instructions to the total number completed instructions", + "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", + "MetricName": "stores_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the Advanced Matrix eXtensions (AMX) execution engine was busy with tile = (arithmetic) operations", + "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", + "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", + "MetricName": "tma_amx_busy", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * FRONTEND_RETIRED= .L1I_MISS:R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * FRONTEND_RETIRED.L2_MISS= :R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * FRONTEND_RETIRE= D.ITLB_MISS:R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * FRONTEND_RETIRED.STLB_= MISS:R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * BR_MISP_RETIRED.= COND_NTAKEN_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * BR_MISP_RETIRED.C= OND_TAKEN_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (79 * tma_i= nfo_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < M= EM_LOAD_L3_HIT_RETIRED.XSNP_MISS:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS *= (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequen= cy) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RETIRED.XSNP_= FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_core_freque= ncy) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_HIT_RETIRED= .XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) * (OCR.DEMAND_DATA_R= D.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DA= TA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOA= D_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * MEM_LOA= D_L3_HIT_RETIRED.XSNP_NO_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * (79 *= tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if= 0 < MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSN= P_NO_FWD * (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_co= re_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RET= IRED.XSNP_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_H= IT_RETIRED.XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_inf= o_system_core_frequency) - 4.4 * tma_info_system_core_frequency) * (1 - OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM += OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB= _HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * MEM_INST_RET= IRED.STLB_HIT_LOADS:R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 < MEM_INST= _RETIRED.STLB_HIT_LOADS:R else MEM_INST_RETIRED.STLB_HIT_LOADS * 7) / tma_i= nfo_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * MEM_INST_RE= TIRED.STLB_HIT_STORES:R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if 0 < MEM_I= NST_RETIRED.STLB_HIT_STORES:R else MEM_INST_RETIRED.STLB_HIT_STORES * 7) / = tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRE= D.INDIRECT_CALL_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * BR_MISP_RETIRE= D.INDIRECT_COST:R - BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRED.IN= DIRECT_CALL_COST:R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio" + }, + { + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_core_clks" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_= INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * cpu@FP_ARITH_INST_R= ETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * FRONTEND_RETIRED.AN= Y_DSB_MISS:R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * FRONTEND_RETIRED.MS_FLO= WS:R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * FRONTEND_RETIRED.= UNKNOWN_BRANCH:R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED2.SCALAR + (FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_IN= ST_RETIRED2.VECTOR))", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.128B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.256B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.512B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx512", + "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED2.SCALAR", + "MetricGroup": "Flops;FpScalar;InsType;Server", + "MetricName": "tma_info_inst_mix_iparith_scalar_hp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALA= R_SINGLE\\,umask\\=3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE += 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * c= pu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARI= TH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t" + }, + { + "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_inst_mix_i= nstructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_memory_core_l2_evictions_nonsilent_pki" + }, + { + "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_inst_mix_instr= uctions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_memory_core_l2_evictions_silent_pki" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki" + }, + { + "BriefDescription": "Off-core accesses per kilo instruction for mo= dified write requests", + "MetricExpr": "1e3 * OCR.MODIFIED_WRITE.ANY_RESPONSE / tma_info_in= st_mix_instructions", + "MetricGroup": "Offcore", + "MetricName": "tma_info_memory_mix_offcore_mwrite_any_pki" + }, + { + "BriefDescription": "Off-core accesses per kilo instruction for re= ads-to-core requests (speculative; including in-core HW prefetches)", + "MetricExpr": "1e3 * OCR.READS_TO_CORE.ANY_RESPONSE / tma_info_ins= t_mix_instructions", + "MetricGroup": "CacheHits;Offcore", + "MetricName": "tma_info_memory_mix_offcore_read_any_pki" + }, + { + "BriefDescription": "L3 cache misses per kilo instruction for read= s-to-core requests (speculative; including in-core HW prefetches)", + "MetricExpr": "1e3 * OCR.READS_TO_CORE.L3_MISS / tma_info_inst_mix= _instructions", + "MetricGroup": "Offcore", + "MetricName": "tma_info_memory_mix_offcore_read_l3m_pki" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, + { + "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_dram_bw", + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" + }, + { + "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_l3m_bw", + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" + }, + { + "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_offcore_bw", + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * MEM_INST_RETIRED= .STLB_MISS_LOADS:R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * MEM_INST_RETIRE= D.STLB_MISS_STORES:R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized" + }, + { + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH= 1.RD + UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR) / 1e9 / tma_info_= system_time", + "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_system_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_= INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * cpu@FP_ARITH_INST_R= ETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" + }, + { + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", + "MetricGroup": "IoBW;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_io_read_bw", + "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" + }, + { + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", + "MetricGroup": "IoBW;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_io_write_bw", + "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" + }, + { + "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", + "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", + "MetricName": "tma_info_system_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" + }, + { + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_system_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_system_smt_2t_utilization" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_socket_clks" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization" + }, + { + "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_uncore_frequency" + }, + { + "BriefDescription": "Cross-socket Ultra Path Interconnect (UPI) da= ta transmit bandwidth for data only [MB / sec]", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 64 / 9 / 1e6", + "MetricGroup": "Server;SoC", + "MetricName": "tma_info_system_upi_data_transmit_bw" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * MEM_LOAD_RETIRED.L2_= HIT:R, MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_info_system_core_frequency)) if= 0 < MEM_LOAD_RETIRED.L2_HIT:R else MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_in= fo_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * MEM_LOAD_RETIRED.L3_= HIT:R, MEM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.= 4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_RETIRED.L3_HIT:R else M= EM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.4 * tma_= info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", + "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", + "MetricName": "tma_local_mem", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * MEM_INST_RETIRED.LOCK= _LOADS:R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", + "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", + "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", + "MetricName": "tma_mba_stalls", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", + "MetricExpr": "(MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM:R + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * MEM_LOA= D_L3_MISS_RETIRED.REMOTE_FWD:R) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_R= ETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", + "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", + "MetricName": "tma_remote_mem", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * BR_MISP_RETIRED.RET_COST= :R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * MEM_INST_RETIRE= D.SPLIT_LOADS:R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_r= eal_latency) if 0 < MEM_INST_RETIRED.SPLIT_LOADS:R else MEM_INST_RETIRED.SP= LIT_LOADS * tma_info_memory_load_miss_real_latency) / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * MEM_INST_RETIR= ED.SPLIT_STORES:R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < MEM_INST_RETIRED.S= PLIT_STORES:R else MEM_INST_RETIRED.SPLIT_STORES) / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uncore operating frequency in GHz", + "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTIC= KS) * #num_packages) / 1e9 / duration_time", + "MetricName": "uncore_frequency", + "ScaleUnit": "1GHz" + }, + { + "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data r= eceive bandwidth (MB/sec)", + "MetricExpr": "UNC_UPI_RxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", + "MetricName": "upi_data_receive_bw", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", + "MetricName": "upi_data_transmit_bw", + "ScaleUnit": "1MB/s" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json b/too= ls/perf/pmu-events/arch/x86/graniterapids/memory.json index 38b74c6752c2..5da5a10275ba 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/memory.json @@ -72,7 +72,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -85,7 +84,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -98,7 +96,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -111,7 +108,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", "MSRIndex": "0x3F6", "MSRValue": "0x800", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "23", "UMask": "0x1" @@ -124,7 +120,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -137,7 +132,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -150,7 +144,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -163,7 +156,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -176,7 +168,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -189,7 +180,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -200,7 +190,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -245,6 +234,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and t= he cacheline is homed locally.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_MISS_LOCAL", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F04C04477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that missed the L3 Cache and were supplied by the local socket (DRAM or= PMM), whether or not in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts PM= M or DRAM accesses that are controlled by the close or distant SNC Cluster.= It does not count misses to the L3 which go to Local CXL Type 2 Memory or= Local Non DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_MISS_LOCAL_SOCKET", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70CC04477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", @@ -271,5 +280,95 @@ "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", "SampleAfterValue": "2000003", "UMask": "0x10" + }, + { + "BriefDescription": "Number of times an RTM execution aborted.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED", + "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_EVENTS", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", + "SampleAfterValue": "100003", + "UMask": "0x80" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to various memory events (e.g. read/write capacity and conflicts)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_MEM", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to various memory events (e.g. read/write capacity and conflict= s).", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to incompatible memory type", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_MEMTYPE", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to incompatible memory type.", + "SampleAfterValue": "100003", + "UMask": "0x40" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to HLE-unfriendly instructions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_UNFRIENDLY", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to HLE-unfriendly instructions.", + "SampleAfterValue": "100003", + "UMask": "0x20" + }, + { + "BriefDescription": "Number of times an RTM execution successfully= committed", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.COMMIT", + "PublicDescription": "Counts the number of times RTM commit succee= ded.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times an RTM execution started.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.START", + "PublicDescription": "Counts the number of times we entered an RTM= region. Does not count nested transactions.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts= due to a data capacity limitation for transactional reads", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_READ", + "PublicDescription": "Speculatively counts the number of Transacti= onal Synchronization Extensions (TSX) aborts due to a data capacity limitat= ion for transactional reads", + "SampleAfterValue": "100003", + "UMask": "0x80" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts= due to a data capacity limitation for transactional writes.", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_WRITE", + "PublicDescription": "Speculatively counts the number of Transacti= onal Synchronization Extensions (TSX) aborts due to a data capacity limitat= ion for transactional writes.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times a transactional abort was sig= naled due to a data conflict on a transactionally accessed address", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CONFLICT", + "PublicDescription": "Counts the number of times a TSX line had a = cache conflict.", + "SampleAfterValue": "100003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json= b/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json new file mode 100644 index 000000000000..9129fb7b7ce4 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json @@ -0,0 +1,145 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/other.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/other.json index 8b9b3c920934..7a919999b78b 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/other.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/other.json @@ -1,4 +1,13 @@ [ + { + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts. the event also counts for Machine Ordering count.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists. the event also counts= for Machine Ordering count.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, { "BriefDescription": "ASSISTS.PAGE_FAULT", "Counter": "0,1,2,3,4,5,6,7", @@ -65,6 +74,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", "Counter": "0,1,2,3", @@ -115,6 +154,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x733004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -125,6 +214,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/pipeline.json index 0ef9daf64e2e..da6478607984 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json @@ -32,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -41,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -51,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -61,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -71,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -81,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -91,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -101,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -111,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -121,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -130,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x44" }, @@ -139,7 +128,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -149,7 +137,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x51" }, @@ -158,7 +145,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -168,7 +154,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x50" }, @@ -177,7 +162,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -187,7 +171,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x41" }, @@ -196,7 +179,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -206,7 +188,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -216,7 +197,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x42" }, @@ -225,7 +205,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_COST", - "PEBS": "1", "SampleAfterValue": "100003", "UMask": "0xc0" }, @@ -234,7 +213,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -244,7 +222,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x60" }, @@ -253,7 +230,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -263,7 +239,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET_COST", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x48" }, @@ -511,7 +486,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -521,7 +495,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -530,7 +503,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -539,7 +511,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -548,7 +519,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -558,7 +528,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" @@ -788,6 +757,15 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Cycles stalled due to no store buffers availa= ble. (not including draining form sync).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa2", + "EventName": "RESOURCE_STALLS.SB", + "PublicDescription": "Counts allocation stall cycles caused by the= store buffer (SB) being full. This counts cycles that the pipeline back-en= d blocked uop delivery from the front-end.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json= b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json index e0a45d4ea848..008ceba86fd9 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json @@ -1056,7 +1056,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -1741,7 +1741,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_LOCAL", @@ -1770,7 +1770,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_LOCAL", @@ -1780,7 +1780,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_REMOTE", @@ -1790,7 +1790,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_REMOTE", @@ -1882,7 +1882,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO", @@ -1892,7 +1892,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO_PREF", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconne= ct.json b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.= json index 856ee985ecd4..5c50275c79b0 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json @@ -808,6 +808,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -818,6 +838,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", @@ -1550,6 +1610,33 @@ "UMask": "0x4", "Unit": "UPI" }, + { + "BriefDescription": "Cycles in L0p", + "Counter": "0,1,2,3", + "EventCode": "0x27", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Counter": "0,1,2,3", + "EventCode": "0x28", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Counter": "0,1,2,3", + "EventCode": "0x29", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, { "BriefDescription": "Matches on Transmit path of a UPI Port : Non-= Coherent Bypass", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json b/= tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json index cffb9d94b53d..886b99a971be 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -233,7 +233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -245,7 +245,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -257,7 +257,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -269,7 +269,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -281,7 +281,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -293,7 +293,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -305,7 +305,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -317,7 +317,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -341,7 +341,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -352,7 +352,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -363,7 +363,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -374,7 +374,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -385,7 +385,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -396,7 +396,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -418,7 +418,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -429,7 +429,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -440,7 +440,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -451,7 +451,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -462,7 +462,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -473,7 +473,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -484,7 +484,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -495,7 +495,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -506,7 +506,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -517,7 +517,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -528,7 +528,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -539,7 +539,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -550,7 +550,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -561,7 +561,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -572,7 +572,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -583,7 +583,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -594,7 +594,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -605,7 +605,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -616,7 +616,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -638,7 +638,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -649,7 +649,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -661,7 +661,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -673,7 +673,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -685,7 +685,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -697,7 +697,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -709,7 +709,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -721,7 +721,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -733,7 +733,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -745,7 +745,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -757,7 +757,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -769,7 +769,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -781,7 +781,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -793,7 +793,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -817,7 +817,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -829,7 +829,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -841,7 +841,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -853,7 +853,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1129,7 +1129,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1141,7 +1141,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1153,7 +1153,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1165,7 +1165,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1177,7 +1177,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1201,7 +1201,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1213,7 +1213,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -1225,7 +1225,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -1237,7 +1237,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1249,7 +1249,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1261,7 +1261,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1273,7 +1273,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1285,7 +1285,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1297,7 +1297,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1318,7 +1318,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1329,7 +1329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1340,7 +1340,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1351,7 +1351,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1373,7 +1373,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1384,7 +1384,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1395,7 +1395,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1406,7 +1406,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1417,7 +1417,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1428,7 +1428,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1439,7 +1439,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1450,7 +1450,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1461,7 +1461,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1472,7 +1472,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1483,7 +1483,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1494,7 +1494,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1505,7 +1505,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1516,7 +1516,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1527,7 +1527,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1538,7 +1538,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1549,7 +1549,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1560,7 +1560,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1571,7 +1571,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1582,7 +1582,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1593,7 +1593,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1604,7 +1604,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1615,7 +1615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1626,7 +1626,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1637,7 +1637,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1648,7 +1648,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1659,7 +1659,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1670,7 +1670,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1681,7 +1681,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1692,7 +1692,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1703,7 +1703,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1715,7 +1715,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1727,7 +1727,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1739,7 +1739,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1751,7 +1751,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1763,7 +1763,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1775,7 +1775,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1787,7 +1787,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1799,7 +1799,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1811,7 +1811,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1823,7 +1823,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1835,7 +1835,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1847,7 +1847,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1859,7 +1859,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1871,7 +1871,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1883,7 +1883,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1895,7 +1895,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.jso= n b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json index 08e410b9b0a2..5f4783ff6ce5 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", @@ -188,6 +188,104 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x10", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x20", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x40", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x80", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e and all pages are closed", + "Counter": "0,1,2,3", + "EventCode": "0x88", + "EventName": "UNC_M_POWER_CHANNEL_PPD_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -358,6 +456,28 @@ "PerPkg": "1", "Unit": "IMC" }, + { + "BriefDescription": "subevent0 - # of cycles all ranks were in SR = subevent1 - # of times all ranks went into SR subevent2 -# of times ps_sr_= active asserted (SRE) subevent3 - # of times ps_sr_active deasserted (SRX) = subevent4 - # of times PS-&>Refresh ps_sr_req asserted (SRE) subevent5 - # = of times PS-&>Refresh ps_sr_req deasserted (SRX) subevent6 - # of cycles PS= Ctrlr FSM was in FATAL", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles all ranks were in SR", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json= b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json index 02e59f64a544..9ea852ef190e 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json @@ -7,5 +7,103 @@ "PerPkg": "1", "PublicDescription": "PCU Clockticks: The PCU runs off a fixed 1 = GHz clock. This event counts the number of pclk cycles measured while the = counter was enabled. The pclk, like the Memory Controller's dclk, counts a= t a constant rate making it a good measure of actual wall time.", "Unit": "PCU" + }, + { + "BriefDescription": "Thermal Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x04", + "EventName": "UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Thermal Strongest Upper Limit Cycles : Numbe= r of cycles any frequency is reduced due to a thermal limit. Count only if= throttling is occurring.", + "Unit": "PCU" + }, + { + "BriefDescription": "Power Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x05", + "EventName": "UNC_P_FREQ_MAX_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Power Strongest Upper Limit Cycles : Counts = the number of cycles when power is the upper limit on frequency.", + "Unit": "PCU" + }, + { + "BriefDescription": "Cycles spent changing Frequency", + "Counter": "0,1,2,3", + "EventCode": "0x74", + "EventName": "UNC_P_FREQ_TRANS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Cycles spent changing Frequency : Counts the= number of cycles when the system is changing frequency. This can not be f= iltered by thread ID. One can also use it with the occupancy counter that = monitors number of threads in C0 to estimate the performance impact that fr= equency transitions had on the system.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C2E", + "Counter": "0,1,2,3", + "EventCode": "0x2b", + "EventName": "UNC_P_PKG_RESIDENCY_C2E_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C2E : Counts the= number of cycles when the package was in C2E. This event can be used in c= onjunction with edge detect to count C2E entrances (or exits using invert).= Residency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C6", + "Counter": "0,1,2,3", + "EventCode": "0x2d", + "EventName": "UNC_P_PKG_RESIDENCY_C6_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C6 : Counts the = number of cycles when the package was in C6. This event can be used in con= junction with edge detect to count C6 entrances (or exits using invert). R= esidency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C0", + "Counter": "0,1,2,3", + "EventCode": "0x35", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C0", + "PerPkg": "1", + "PublicDescription": "Number of cores in C0 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C3", + "Counter": "0,1,2,3", + "EventCode": "0x36", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Number of cores in C3 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C6", + "Counter": "0,1,2,3", + "EventCode": "0x37", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C6", + "PerPkg": "1", + "PublicDescription": "Number of cores in C6 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "External Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x0a", + "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "External Prochot : Counts the number of cycl= es that we are in external PROCHOT mode. This mode is triggered when a sen= sor off the die determines that something off-die (like DRAM) is too hot an= d must throttle to avoid damaging the chip.", + "Unit": "PCU" + }, + { + "BriefDescription": "Internal Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x09", + "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Internal Prochot : Counts the number of cycl= es that we are in Internal PROCHOT mode. This mode is triggered when a sen= sor on the die determines that we are too hot and must throttle to avoid da= maging the chip.", + "Unit": "PCU" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 1a28f9d70e74..43919beef126 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -12,7 +12,7 @@ GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core -GenuineIntel-6-A[DE],v1.02,graniterapids,core +GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v35,haswell,core GenuineIntel-6-3F,v28,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BFF319FA9D for ; Mon, 9 Dec 2024 22:28:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783335; cv=none; b=eosgfcPE61MTDY3qRmGJaMmmT+1rwSOHWkQNwl+3MOaPPlahm4JTAzC5Jg6QWhIHV4Xn3A4X+tzoNgMurr9yFNbdaFMnGwJpo932WISVmc9vfxyRqGeVb2xHK/Vo8t8PWiwzI7u8UeJzZA6U1Aa6J/5QIgKWRrS5PSu4ColgNnM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783335; c=relaxed/simple; bh=O/R9CFhLL0MKT16oWs2B8qQxxTnC8yEGwVqgjmm/1KA=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=K86bqShNjea7Sa64naykQeHec+1eYCzSFo4UaXmUoaRjbWTFkhHY1U0FryJTkwA8Sc5XK7B2zyV7c1LyE3AeONtZ4rX4qD2puuao0CHv7gA8q7yrJ6F9ChJWEqfwBjngkUt15nMtqLrcwpFXDUXLOCu80X9b2rBv5TKt4i8aKqY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4ssJdY2R; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4ssJdY2R" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e398b84e0f1so8525412276.1 for ; Mon, 09 Dec 2024 14:28:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783314; x=1734388114; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=GtO4ukzBG4nNKriiVsmzeysxk/ovYMUFt3xUZEY6Lt4=; b=4ssJdY2RmMY8viIErHzZYbeSaHNm1nvL0wuuIrFaNASXtRpVhFr1m3QAIvEcdlzYQq 6/o0ClCOnxVMLhyPpSm/vNQaibu0pNtalyjwLysxgTMv+bp60hojcpC1M2nGfCqeqf5a xumgjWV410wkjkmo/slF92eHonyMjQODu04W+y4LdGFS0aqp2x7w32L2ND5FaCCVy9tW /ISCIxtQSwiuccSxyh2TZbseTUs6zuru8UU9RVM92pbqXKbB1IAlCgtYwIWWtHmHr+xp eFBeElbhCca7cj1mRPiwUPe4XzUNCtzyolTxhryfxZ3L/ka6t00wCbMTD4BMuH8iGO65 EmkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783314; x=1734388114; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=GtO4ukzBG4nNKriiVsmzeysxk/ovYMUFt3xUZEY6Lt4=; b=Nd+QgOB2sQyURgLGfnPOtnZyqOypxztFUMVdaLJAOrz7tBzQ40AtQewduCTqzQHstM Mv5kBotpKMJ7Ntsn5ddPxROKCfKuoejaDBU1ZYvgzA5ZYUBp5oMYZGsbaPQJ0Sl6N1HH P04xFmMz0jz54QxZka7G22q51xzVGBTQrgtFEPwS9bl0Co07O0LDJFRKYFWqqulTmwiI zwHlzR6UZrn4Qro/HlRHBKGSUWUgPIyrsI457aaos5GJAesqsv9b/D+DPRTyBrPcUiTS hoqPmgpPOj5mN78I6wcPpwnevFjok5Y7mNlxiVgXd2wwInE9bgPJrzafRLCaoYPBh00A ZTEA== X-Forwarded-Encrypted: i=1; AJvYcCVQVn5pxpBvkVowzBjOJ6n4kzSXuDoZBLtY0+RbyGnlZn1pBXpqUfQ/bGokz+nfD7HxNhUYmETkjJ7b1pU=@vger.kernel.org X-Gm-Message-State: AOJu0Yz06oY3fGxuJ7GiP4B3j0G3YBaAfUZQ3dAOhtjmnlXSahFZVhty PiWGx2aK9yKK63/hrb5CraXkm2gXbXJAKi1rcKzNZ5ZQcIAHO+Ip2j/GAqG8wSXaRp7RJMFFCgX k321a3A== X-Google-Smtp-Source: AGHT+IGM3NwhrorsatgMLfM/XYfm97HxVtCaqCD7CYg5sfEBN8d6w9fqfwd25tCxisvYj8/yeJmhEHCfbuR6 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:b285:0:b0:e38:81f0:8c23 with SMTP id 3f1490d57ef6-e3a5de199e4mr9305276.6.1733783314532; Mon, 09 Dec 2024 14:28:34 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:48 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-12-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 11/22] perf vendor events: Update Haswell events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v35 to v36. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v36: https://github.com/intel/perfmon/commit/616ec6fc0315dac35c1bea0abc7f59e21a2= d51c0 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/haswell/hsw-metrics.json | 260 ++++++++++-------- .../pmu-events/arch/x86/haswell/memory.json | 2 +- .../arch/x86/haswell/metricgroups.json | 5 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 4 files changed, 151 insertions(+), 118 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json b/tool= s/perf/pmu-events/arch/x86/haswell/hsw-metrics.json index b693c0b0cafe..0c1040b7e38c 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -143,18 +143,18 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", "ScaleUnit": "100%" }, { @@ -165,7 +165,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -174,8 +174,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -183,7 +183,7 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, @@ -193,8 +193,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -203,7 +203,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -220,8 +220,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { @@ -229,27 +229,27 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_= CORE / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE.= Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_cle= ars", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -279,33 +279,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -316,7 +316,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -328,7 +328,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -347,7 +347,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -392,7 +398,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -415,7 +421,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -427,7 +433,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -445,7 +451,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -464,7 +470,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -496,14 +502,14 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -521,7 +527,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" @@ -531,13 +537,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -546,6 +553,19 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -558,6 +578,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -565,7 +592,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -574,7 +601,8 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -600,24 +628,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", "ScaleUnit": "100%" }, { @@ -625,8 +653,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -635,8 +663,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -645,8 +673,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -654,18 +682,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -682,10 +710,10 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -696,15 +724,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -713,19 +741,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -734,7 +762,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -743,7 +771,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -751,8 +779,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -761,7 +789,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -770,7 +798,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", "ScaleUnit": "100%" }, { @@ -806,7 +834,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -815,7 +843,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -830,46 +858,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -888,8 +916,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -897,16 +925,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -915,8 +943,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -924,18 +952,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/haswell/memory.json b/tools/per= f/pmu-events/arch/x86/haswell/memory.json index edb1b5b9f553..a76f1862e5ea 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/memory.json +++ b/tools/perf/pmu-events/arch/x86/haswell/memory.json @@ -412,7 +412,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "2", + "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x4" }, diff --git a/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/haswell/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 43919beef126..903d6ada7060 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -13,7 +13,7 @@ GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core -GenuineIntel-6-(3C|45|46),v35,haswell,core +GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v28,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7838D19E830 for ; Mon, 9 Dec 2024 22:28:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783343; cv=none; b=TUggMNpCGIqPK5bJykRQIq1fQMEiUnBA7kWZ1c9vlvpAE7AWFSVFbxhh10HOuvlUft47Uv3cSgZAB1b3u7RTQmh2Skc694BF9tf7w89K+8+4LBZ+8OQfH4aDOQUKxKlo4APglpY+yURumnCYWmyJDYdeaJ25zqzss4Sa5YfFCPE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783343; c=relaxed/simple; bh=sSPgpyTCAiLPucTa4obuMMC4+qNlmoKnD0o/ckzsWlU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=U7MT6TjcrlANUPnyVi5prleFGl4embSDyS8iFhd0eu6KqVM26psY/8tp8HNWTBnFBG9plGA35OJ8Wt2rI0QZZjXBqFntaQTJSBt70ZpUlgNDlU2qS79/7c+WEtVH5zW3TcetLRBJ5kpZVZWJUEjni7fUOjwH8tO5Ox4vtaF4UOw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=HXSDxTSA; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="HXSDxTSA" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ef4e426a3fso8622297b3.0 for ; Mon, 09 Dec 2024 14:28:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783317; x=1734388117; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Py+JhVWHoRj2WWldJqCyq0p8P5Al3/30p4SRpdI70Vo=; b=HXSDxTSAYi+wEh2MoNNncdFMPYv4oR/Upn3zSzsyk2VxKeBQ28ylvdUuubyL0GFTPR 0hXymKyPnxgQIk3rGgTVlDwdCn25tHLKcTBo46AhaKRfU1cQS/g15lK7tS3hc2VKmqWD Ix2F4jNTxbfZdzjnd4RxK11UrCXLqp/IfqgHAZOBLNiR0HBkG8jyCHIxCQItIVSMYcpk wWhRevmT9ShimAeFgyoY8JyFKCWRk7/wf9xCALhgtjWPxMnxBiMD3iKfqWH54eQ+2nQd n43jik1CGHhS3APn8kZ+RDVCqMvuzqXeUYIHjPY8aRFZamyNMjqCujfG4N/Y3uWqyB6d ZMtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783317; x=1734388117; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Py+JhVWHoRj2WWldJqCyq0p8P5Al3/30p4SRpdI70Vo=; b=gU4QTRcNm8jbuNPrd7/wPYNXHHzMFmuXU2As6zafSmfAqNz5KhyjXEe4OjFq30CKNR lTrkBlVAu6jAlMeONl/DsVLu/2PsMtyrVA79MljJCbwmZ7mNlTZ1xMDkTCeoLX+Gu9gF 2flN4YPsCeueGD8XfJ5bqvPxWtDU+I0GxTrwjVmQAjvP6bz05PglmaW2nzaKVtEVZd9M vND4/4VsVI/qDkOIF3G1UL6Xn/3TIcBm/7ty32LnMaBjAapC1TNvNYFUlSx5cSReKg85 dk+U3L5LB19aE3GEnxInel3+9aH3zv5PWB8S0U+wim6xI4k/14YCE6lwphYy/MoRPYNs fpAg== X-Forwarded-Encrypted: i=1; AJvYcCWUHgajjSVVrf5dem7JgYeai58p8jpgnXW9oKn5sqgNk/+pMI1EEp87O5vriF/SnYcEmmkhmdyWKtEMCew=@vger.kernel.org X-Gm-Message-State: AOJu0Yyv053m5LqJxTFJC09T3bs1c/lQixu7qVpF53KakI88JhiFPhPk wgIC+oVcGN3gConSMPhv3I0J7hI99Nl86/pnDErcyYuD54Zpxbq3u2pEj9cGDUB8rPslcjGdotR VGDpzHg== X-Google-Smtp-Source: AGHT+IE2ghIb7NQvpOlk0r5umPIxrYotQeiXfMt4U5GmqzB5CitoxVB1mDohHazMpEnZ5dxUHWdIENH7KMl6 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2f85:b0:6ea:1f5b:1f54 with SMTP id 00721157ae682-6efe3ac5ecbmr202277b3.0.1733783316897; Mon, 09 Dec 2024 14:28:36 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:49 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-13-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 12/22] perf vendor events: Update HaswellX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v28 to v29. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v29: https://github.com/intel/perfmon/commit/71dbf03aba964f79fb096c9ded385c8a486= a99b3 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/haswellx/hsx-metrics.json | 296 ++++++++++-------- .../arch/x86/haswellx/metricgroups.json | 5 + .../arch/x86/haswellx/uncore-cache.json | 28 +- .../x86/haswellx/uncore-interconnect.json | 38 +-- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 5 files changed, 191 insertions(+), 178 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json b/too= ls/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json index 8f2ba3391e35..1a05b74be575 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,24 +76,24 @@ "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x19= e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x1c= 8\\,filter_tid\\=3D0x3e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -102,14 +102,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -209,13 +209,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_o= pc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=3D= 0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x182@)= ", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_= opc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\= =3D0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x18= 2@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -345,18 +345,18 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", "ScaleUnit": "100%" }, { @@ -367,7 +367,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -376,8 +376,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -385,7 +385,7 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, @@ -395,8 +395,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -405,7 +405,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -413,7 +413,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -422,8 +422,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { @@ -431,27 +431,27 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -481,33 +481,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -518,7 +518,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -530,7 +530,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -549,7 +549,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -594,7 +600,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -617,7 +623,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -629,7 +635,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -647,7 +653,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -666,7 +672,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -698,14 +704,14 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -723,7 +729,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" @@ -733,13 +739,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -750,18 +757,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", + "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -774,6 +794,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -782,12 +809,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -796,7 +823,8 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -822,24 +850,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", "ScaleUnit": "100%" }, { @@ -847,8 +875,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -857,8 +885,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -867,8 +895,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -876,18 +904,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -905,18 +933,18 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -927,15 +955,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -944,19 +972,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -965,7 +993,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -974,7 +1002,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -982,8 +1010,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -992,7 +1020,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1001,7 +1029,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", "ScaleUnit": "100%" }, { @@ -1037,7 +1065,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1046,7 +1074,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1061,46 +1089,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1109,8 +1137,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1118,8 +1146,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1138,8 +1166,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1147,16 +1175,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1165,8 +1193,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1174,18 +1202,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/haswellx/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json b/to= ols/perf/pmu-events/arch/x86/haswellx/uncore-cache.json index 3c23bafcba28..d664af16c1db 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json @@ -812,7 +812,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1869,7 +1869,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1879,7 +1879,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2874,7 +2874,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2884,7 +2884,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2894,7 +2894,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2904,7 +2904,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3438,7 +3438,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3448,7 +3448,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3458,7 +3458,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3878,7 +3878,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3888,7 +3888,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3898,7 +3898,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3908,7 +3908,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.js= on b/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json index 121de411d312..8f73a8649b39 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json @@ -1,24 +1,4 @@ [ - { - "BriefDescription": "Number of non data (control) flits transmitte= d . Derived from unc_q_txl_flits_g0.non_data", - "Counter": "0,1,2,3", - "EventName": "QPI_CTL_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL= non-data flits transmitted across QPI. This basically tracks the protocol= overhead on the QPI link. One can get a good picture of the QPI-link char= acteristics by evaluating the protocol flits, data flits, and idle/null fli= ts. This includes the header flits for data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x4", - "Unit": "QPI" - }, - { - "BriefDescription": "Number of data flits transmitted . Derived fr= om unc_q_txl_flits_g0.data", - "Counter": "0,1,2,3", - "EventName": "QPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data fli= ts transmitted over QPI. Each flit contains 64b of data. This includes bo= th DRS and NCB data flits (coherent and non-coherent). This can be used to= calculate the data bandwidth of the QPI link. One can get a good picture = of the QPI-link characteristics by evaluating the protocol flits, data flit= s, and idle/null flits. This does not include the header flits that go in = data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x2", - "Unit": "QPI" - }, { "BriefDescription": "Total Write Cache Occupancy; Any Source", "Counter": "0,1", @@ -53,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -63,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -73,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -83,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -93,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -103,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -113,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -123,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, @@ -493,7 +473,7 @@ "EventCode": "0x16", "EventName": "UNC_I_TRANSACTIONS.WRITES", "PerPkg": "1", - "PublicDescription": "Counts the number of Inbound transactions fr= om the IRP to the Uncore. This can be filtered based on request type in ad= dition to the source queue. Note the special filtering equation. We do OR= -reduction on the request type. If the SOURCE bit is set, then we also do = AND qualification based on the source portID.; Trackes only write requests.= Each write request should have a prefetch, so there is no need to explici= tly track these requests. For writes that are tickled and have to retry, t= he counter will be incremented for each retry.", + "PublicDescription": "Counts the number of Inbound transactions fr= om the IRP to the Uncore. This can be filtered based on request type in ad= dition to the source queue. Note the special filtering equation. We do OR= -reduction on the request type. If the SOURCE bit is set, then we also do = AND qualification based on the source portID.; Tracks only write requests. = Each write request should have a prefetch, so there is no need to explicit= ly track these requests. For writes that are tickled and have to retry, th= e counter will be incremented for each retry.", "UMask": "0x2", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 903d6ada7060..71cf64060499 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -14,7 +14,7 @@ GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core -GenuineIntel-6-3F,v28,haswellx,core +GenuineIntel-6-3F,v29,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBF5319E833 for ; Mon, 9 Dec 2024 22:28:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783357; cv=none; b=QrfkTD+Cco6m4KBF27Fo6P65FGBql1Bs8O9R6NYXyfwTPx8jW/cEvVmOtBE1ZfSiZziNtm9q0kYyXOf5XhDI6Unp0C4ISHA/eHVRhVo7FSTUBnbRsYR9GOg6FuZNnGz2Qf4HRg4bzgmLqcMu+NfFwxOaZtgMGxzZ5eKMSM/RfzI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783357; c=relaxed/simple; bh=ciHxD2Ake/UH+TjK/5VPyMuEmGJLfiEz7R0+OblEkg4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=qFTiQWRPrVkLrMa18FhGGkKh0/3+7go2zKsLBuCxgwnOhkfPYSDpozSwelu5SeoB9Jb0LX1LEAa2Jhp+NluxR0qRonRh+9YrNNQ+cVBP2KtDbfKhHrVgEkbRThfNhUX2ntv7vwxHv96vAi2K0B1Yb0rzp6kWmzxJeK36ZilG7zQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1TQOf7hZ; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1TQOf7hZ" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ef80598ff7so49065547b3.2 for ; Mon, 09 Dec 2024 14:28:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783319; x=1734388119; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=LRF0VrdyNuYhx20WdJYsh6CXCFjyQDRvxXszp10usd8=; b=1TQOf7hZC+OryQqYXGZd3vOstWDxdG6EeV7bDK5ufpIbY0FWXlsa5wQJqXyTcCOcOI /Y39rY1g4I1kUDcIATsXZy0UnoaLb52+V9tHE/uYrlFXdmjzBwVcjdXOjrxs+BEmYr0e bwUJl5K6UikKL1DkUQRhdVbys5zCNFfag+ac8mQrmoka9qzD6xEORe1EFYkEYwGxHydb 9XMnkoYeSHxRMn/qtkT4l5TbcOIo4Q/OtVoXpg/s2V7xMU6xxb6v+iaoqoJ5jj5kt5wC yEVheqGGeU8kY4LVXDjc81S1u/y/xRL62UphuF1CTv3zPCIYK0+8MtpVAk38+9XRNVmi USBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783319; x=1734388119; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=LRF0VrdyNuYhx20WdJYsh6CXCFjyQDRvxXszp10usd8=; b=cha6Y3OZdl/zYjkIkx8jhINmcaw79n8L1kS20P1lxjSZbTH3/f0cIVq7DbFxKuDwmu nbbE2Jq9ZRdfUaBBTDclpcf4NENOMYQf/0le5b5lzAlv2XIxvuDbAWufvx8b9K4EjCvS O4n2s+9fSAlRQ/5QbQsSFs6joAsL5ogzlL9vP8gMvIjFo3nK/Y0vzNTyGu4zU2ZEI64E CaoyPWy4izAb9tbCvMRJsJ8ngb8B6lTY4x41+hVkaT3Gz21v1rM/0Ec3rMXEO0fAeguH LB5U7dXEj7ZQ5pz2ShXDT0ZHGGgVV69vqqpmaVMWRA88NjCuYRKOIQdas9dNskHf4G/2 EP6Q== X-Forwarded-Encrypted: i=1; AJvYcCXemJyoySW/NSLulnKeoFLHUUgWZIZw4A2gOPjcLRG/KLCby7iVit/y9m4XOEtGzfK/rLe1kw+UeUjN9N4=@vger.kernel.org X-Gm-Message-State: AOJu0YzIYPkSeY4ebWY/RWcGs73Z8UzaCny+6Qk1Zzq3F8k+Mid46Rsq TVijUmDpZTgNtAL1dy2+sDRKHz/VQMBnf4zHNYtL21MLB7ca00WtOMeBq1OZzcbapLbSeDU4bnT hmc69nQ== X-Google-Smtp-Source: AGHT+IHOvzMQ5ZJqLumVjciRqCOPLz7UZv9pftKSmiALh4kppYw2hBHTFTi7byO8dQ8cEDB16XZb+MK1uit3 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:3512:b0:6ef:84e2:6f31 with SMTP id 00721157ae682-6efe3dde2a9mr2106387b3.8.1733783319427; Mon, 09 Dec 2024 14:28:39 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:50 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-14-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 13/22] perf vendor events: Update Icelake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.22 to v1.24. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.24: https://github.com/intel/perfmon/commit/d4f10746cf549466723d17cd214e1ee9cb7= bac11 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/icelake/cache.json | 34 +- .../pmu-events/arch/x86/icelake/frontend.json | 17 - .../arch/x86/icelake/icl-metrics.json | 849 +++++++++--------- .../pmu-events/arch/x86/icelake/memory.json | 13 +- .../arch/x86/icelake/metricgroups.json | 10 +- .../pmu-events/arch/x86/icelake/pipeline.json | 30 +- .../arch/x86/icelake/uncore-interconnect.json | 76 -- .../arch/x86/icelake/uncore-other.json | 2 +- .../arch/x86/icelake/virtual-memory.json | 18 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 500 insertions(+), 551 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelake/cache.json b/tools/perf= /pmu-events/arch/x86/icelake/cache.json index 3508340acd0e..015f70f157d1 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelake/cache.json @@ -75,11 +75,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -251,7 +251,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -262,7 +261,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -273,7 +271,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -284,7 +281,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -295,7 +291,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -306,7 +301,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -317,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -328,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -339,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -350,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -361,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -372,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -383,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -394,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -405,7 +391,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -416,7 +401,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -427,7 +411,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -438,7 +421,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -449,7 +431,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -460,7 +441,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -910,6 +890,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with outstanding code read requests pe= nding.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Cycles with outstanding code read requests p= ending. Code Read requests include both cacheable and non-cacheable Code R= eads. Requests are considered outstanding from the time they miss the core= 's L2 cache until the transaction completion message is sent to the request= or.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles where at least 1 outstanding Demand RF= O request is pending.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelake/frontend.json b/tools/p= erf/pmu-events/arch/x86/icelake/frontend.json index e7c7d4d4152d..7afa2436d584 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/icelake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tool= s/perf/pmu-events/arch/x86/icelake/icl-metrics.json index 9085ea60f516..63e28a03dc60 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,32 +604,11 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "tma_info_botlnk_l0_core_bound_likely", - "MetricGroup": "Cor;Metric;SMT", - "MetricName": "tma_info_botlnk_core_bound_likely", - "MetricThreshold": "tma_info_botlnk_core_bound_likely > 0.5" - }, - { - "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd))", - "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", - "MetricName": "tma_info_botlnk_dsb_misses", - "MetricThreshold": "tma_info_botlnk_dsb_misses > 10" - }, - { - "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", - "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", - "MetricName": "tma_info_botlnk_ic_misses", - "MetricThreshold": "tma_info_botlnk_ic_misses > 5" - }, { "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", "MetricConstraint": "NO_GROUP_EVENTS", @@ -491,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -500,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -509,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -671,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -688,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -737,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -755,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -763,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -771,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -779,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -787,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -795,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -840,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -850,21 +886,9 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, - { - "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "tma_info_memory_mix_bus_lock_pki", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_bus_lock_pki" - }, - { - "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "tma_info_memory_tlb_code_stlb_mpki", - "MetricGroup": "Fed;MemoryTLB;Metric", - "MetricName": "tma_info_memory_code_stlb_mpki" - }, { "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", @@ -889,12 +913,6 @@ "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" }, - { - "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "tma_info_memory_latency_data_l2_mlp", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_data_l2_mlp" - }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", @@ -903,16 +921,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l1d_cache_fill_bw_2t" - }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", @@ -927,16 +939,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l2_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l2_cache_fill_bw_2t" - }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", @@ -969,28 +975,16 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, - { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_access_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", - "MetricName": "tma_info_memory_l3_cache_access_bw_2t" - }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l3_cache_fill_bw_2t" - }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", @@ -1005,52 +999,22 @@ }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "tma_info_memory_load_l2_miss_latency", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, - { - "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l2_miss_latency" - }, - { - "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_load_l2_mlp" - }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x0@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency" }, - { - "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "tma_info_memory_tlb_load_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_load_stlb_mpki" - }, { "BriefDescription": "\"Bus lock\" per kilo instruction", "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", @@ -1059,7 +1023,7 @@ }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "tma_info_memory_uc_load_pki", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki" }, @@ -1071,17 +1035,11 @@ "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "tma_info_memory_tlb_page_walks_utilization", - "MetricGroup": "Core_Metric;Mem;MemoryTLB", - "MetricName": "tma_info_memory_page_walks_utilization", - "MetricThreshold": "tma_info_memory_page_walks_utilization > 0.5" - }, - { - "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "tma_info_memory_tlb_store_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_store_stlb_mpki" + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1108,15 +1066,9 @@ "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, - { - "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_uc_load_pki" - }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1143,18 +1095,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1172,14 +1124,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1189,13 +1141,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1204,12 +1157,25 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1217,7 +1183,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1225,7 +1191,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1239,6 +1205,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1246,7 +1219,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1255,14 +1228,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1272,13 +1246,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1294,33 +1268,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1329,8 +1311,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1339,17 +1330,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", + "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1357,18 +1348,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1385,7 +1376,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1393,16 +1384,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1412,7 +1427,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1422,16 +1437,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1439,8 +1454,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1450,11 +1465,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1468,7 +1483,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1476,8 +1491,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1491,19 +1506,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1511,8 +1534,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1520,7 +1543,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1535,19 +1558,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1583,7 +1606,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1591,8 +1614,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1600,8 +1623,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1609,7 +1632,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1618,7 +1641,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1627,14 +1650,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1647,7 +1670,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1656,7 +1679,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1665,8 +1688,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1675,17 +1698,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1693,8 +1716,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1703,17 +1726,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1730,7 +1753,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1738,7 +1761,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1746,7 +1793,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1755,7 +1802,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1764,8 +1811,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelake/memory.json b/tools/per= f/pmu-events/arch/x86/icelake/memory.json index f73035f44330..abaf3f4f9d63 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelake/memory.json @@ -88,7 +88,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -101,7 +100,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -114,7 +112,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -127,7 +124,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +136,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -153,7 +148,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -166,7 +160,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -179,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -287,17 +279,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/icelake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json b/tools/p= erf/pmu-events/arch/x86/icelake/pipeline.json index 4fdf07c7beb7..44659f26cbab 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.jso= n b/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json index 909a73d7f2d3..bee503eda18c 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json @@ -8,71 +8,6 @@ "UMask": "0x1", "Unit": "ARB" }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_IFA_OCCUPANCY.ALL", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x85", - "EventName": "UNC_ARB_DAT_OCCUPANCY.ALL", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x1", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated.", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x85", - "EventName": "UNC_ARB_DAT_OCCUPANCY.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_TRK_OCCUPANCY.RD", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_REQ_TRK_OCCUPANCY.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", - "Counter": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated.", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_TRK_OCCUPANCY.ALL", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x1", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_REQ_TRK_OCCUPANCY.DRD", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_TRK_OCCUPANCY.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, { "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", "Counter": "1", @@ -81,16 +16,5 @@ "PerPkg": "1", "UMask": "0x1", "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_REQ_TRK_REQUEST.DRD", - "Counter": "0,1", - "Deprecated": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_TRK_REQUESTS.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" } ] diff --git a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json b/too= ls/perf/pmu-events/arch/x86/icelake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json b/t= ools/perf/pmu-events/arch/x86/icelake/virtual-memory.json index 3ff51040f84f..9df790d4361f 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 71cf64060499..5a20439450ab 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -15,7 +15,7 @@ GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core -GenuineIntel-6-7[DE],v1.22,icelake,core +GenuineIntel-6-7[DE],v1.24,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C443D19E99E for ; Mon, 9 Dec 2024 22:28:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783359; cv=none; b=ME97/ui5nTuYjbruOxkWNyLBtAPRq+xIOyTSkY0467C6ygFeIfK5DTQSgTW8n2THtwpWBv1W1NdL/MYi3C70KV1XOnxfqNPZzM7eh9OIUAtDMCUho74hmcsRXLA0RAoTotcqVFivozKb6ojYgMiQvrGUjM1LKA2v7FdBFuTcC9Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783359; c=relaxed/simple; bh=4JYnVh8CbLD7fl3w8V9WirRqL+rSvpskmlxDXlXC/EI=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=qtBKZd9MRDJvp0A8nhO5JVdEDH1UKypDmgmtLymVgECoMjeSuaZUHhLkteeKgmvwEd7upD7hY1MHqNXjb5pUrwHddrEKkrRBCxCdnp8gfoHqwMkz4QWywsokTj9n4OWfmULje+3pvWSJDmL8RISYTf7B74cGc6EPeH/PiSXVfYY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=fzXxFsTZ; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="fzXxFsTZ" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e3a1f928b18so1371662276.2 for ; Mon, 09 Dec 2024 14:28:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783322; x=1734388122; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=e5ndn5M7EvReNq3EH//NyZbg7bX6Ay49WvetmEKoUBE=; b=fzXxFsTZefUYKrwCQROuwkYcI0HDdM2VZ72PIHweNmuamM8QP9MzAd5/v7ZRRk6k/V IY2WZsTGim1PXy6B5R2ybYOUR+eHSP7DzfU+OBaCJwKgghZi5F9J0TfayblIKG1htj1X LeaT9tvkGJPHp383Hr/lhRPCe4jWj5T5xGCGtEqHtzmxV6SSe1ojhhb3vQDWmjguZb+v 7O4RFw5rcWkwp8hXi+OD/+CYN9MGUXbB6ltO1it8te3E/Ap6Gu9W4ZVWJMSfHGnRH101 Nt0LWUoNWd0tBkRHNwTOBqjC09DS0f0QtBkSSLPdjD/bg7NwPsmSpEpFN+SY/qo4OkCv f1rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783322; x=1734388122; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=e5ndn5M7EvReNq3EH//NyZbg7bX6Ay49WvetmEKoUBE=; b=JPNACGjUlHipT/D+xXHs+dW8Whej2JnLVfGWllfsEFR2Kjja4RFttdf9sTEupOfO7b RjXV//nmLjMkeuhw6M9Z0Q2su360SoWVUP21/d4mAVzo3CMx2lCBgUnhbgrgPblfX3Gw uwTRzf6Rm7fLNFC7aLf8fPR4nisTl95AEIhhxBkWUeoF76gTh/AWyhUEe/Okflgcqtb6 wFxl0pAbvEN4R+bWeC+AW9s7PDmkIS9zx4flQuwc4oRo4Hpb1KIwkygET96bcEQbkVN9 i2yfjF6RrAqZ+xhRXv9hHQEyMznk6Vl8Ud0on/XIPFMaKOw5pscWSM2p91tUZOpmtAqJ jY9g== X-Forwarded-Encrypted: i=1; AJvYcCVJEu5vsKVgTjUAByEi7Wx2GLEb1SBmYl1tw1ohvbCm1l+GZ+6amb3R+vEihfF/sJ6zQEHaJDzfPEUfv44=@vger.kernel.org X-Gm-Message-State: AOJu0Yx63FFhQADfNnoO75gLy5TKUucPIJBnCfnGAK1K0R6Wt3X+8VQk 0md9gQPny2n40tpTq0u86PeFljaKmSxrg0of3lECt8axwzkvNY/nwjq0wv57w2jINRazhV/DqHz 69JCHEg== X-Google-Smtp-Source: AGHT+IECDovSrLGZ1bDxuKNZrIvK1Y8piYxcMOqCcqjIPRTbNQctj2LvgTU0U/kUqXtBMpBQ857MphAyPtPX X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:dccb:0:b0:e39:8986:98b0 with SMTP id 3f1490d57ef6-e3a0b522f92mr7056276.9.1733783321837; Mon, 09 Dec 2024 14:28:41 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:51 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-15-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 14/22] perf vendor events: Update IcelakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.26 to v1.27. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.27: https://github.com/intel/perfmon/commit/6ee80d0532a778caee68d6e29d8e0527856= 7e69f The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/icelakex/cache.json | 41 +- .../arch/x86/icelakex/frontend.json | 17 - .../arch/x86/icelakex/icx-metrics.json | 852 ++++++++++-------- .../pmu-events/arch/x86/icelakex/memory.json | 13 +- .../arch/x86/icelakex/metricgroups.json | 10 +- .../arch/x86/icelakex/pipeline.json | 30 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 533 insertions(+), 432 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelakex/cache.json b/tools/per= f/pmu-events/arch/x86/icelakex/cache.json index 0cbb9d6a3ec1..e8ab6ef2cd50 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/cache.json @@ -75,14 +75,23 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3", + "EventCode": "0xf2", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "L2 code requests", "Counter": "0,1,2,3", @@ -224,7 +233,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -235,7 +243,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -246,7 +253,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -257,7 +263,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -268,7 +273,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -279,7 +283,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -290,7 +293,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -301,7 +303,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -312,7 +313,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -324,7 +324,6 @@ "Deprecated": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "SampleAfterValue": "20011", "UMask": "0x2" }, @@ -335,7 +334,6 @@ "Deprecated": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "SampleAfterValue": "20011", "UMask": "0x4" }, @@ -345,7 +343,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -356,7 +353,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -367,7 +363,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -378,7 +373,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -389,7 +383,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x2" }, @@ -399,7 +392,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -410,7 +402,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was remote HITM.", "SampleAfterValue": "100007", "UMask": "0x4" @@ -421,7 +412,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100007", "UMask": "0x10" @@ -432,7 +422,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -443,7 +432,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -454,7 +442,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -465,7 +452,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -476,7 +462,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -487,7 +472,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -498,7 +482,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -509,7 +492,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -520,7 +502,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100003", "UMask": "0x80" diff --git a/tools/perf/pmu-events/arch/x86/icelakex/frontend.json b/tools/= perf/pmu-events/arch/x86/icelakex/frontend.json index d79ddc15b220..3b1bdcfaa276 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/frontend.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/too= ls/perf/pmu-events/arch/x86/icelakex/icx-metrics.json index db5510ba9099..7bee03e532e4 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,49 +55,55 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_PCIRDCUR + UNC_CHA_TOR_= INSERTS.IO_MISS_PCIRDCUR) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" @@ -106,14 +112,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -164,6 +170,12 @@ "MetricName": "llc_code_read_mpi_demand_plus_prefetch", "ScaleUnit": "1per_instr" }, + { + "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_= TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETI= RED.ANY", + "MetricName": "llc_data_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, { "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) in nano seconds", "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_= OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", @@ -182,6 +194,12 @@ "MetricName": "llc_demand_data_read_miss_latency_for_remote_reques= ts", "ScaleUnit": "1ns" }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to DRAM in nano seconds= ", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_to_dram_latency", + "ScaleUnit": "1ns" + }, { "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to Intel(R) Optane(TM) = Persistent Memory(PMEM) in nano seconds", "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", @@ -189,25 +207,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -237,19 +255,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -317,12 +335,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -334,7 +352,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -357,11 +375,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 *= tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredicts *= tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_= nukes + tma_core_bound * (tma_serializing_operation + tma_core_bound * RS_E= VENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0) / (tma_di= vider + tma_serializing_operation + tma_ports_utilization) + tma_microcode_= sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (tma_as= sists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -375,7 +486,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -383,8 +494,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -392,8 +503,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -401,18 +512,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD)= )) + 43.5 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((48 * tma_info_system_core_frequency - 4 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DE= MAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_fr= equency - 4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSN= P_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tm= a_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -422,25 +581,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "43.5 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -449,18 +608,18 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma= _info_thread_clks - tma_l2_bound - tma_pmm_bound if #has_pmem > 0 else CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und)", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -469,7 +628,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -477,44 +636,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "48 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(120 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 48 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -524,7 +683,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -534,16 +693,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -552,7 +711,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -561,7 +720,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -569,7 +736,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -578,7 +745,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -587,8 +754,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -596,8 +763,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -605,7 +772,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -617,17 +784,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -635,41 +802,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -683,7 +850,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -698,8 +865,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -707,7 +874,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -716,108 +883,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latenc= y + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tm= a_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_s= tore_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_l= atency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_st= ore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma= _false_sharing + tma_split_stores + tma_store_latency + tma_streaming_store= s)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = + tma_streaming_stores - tma_store_latency)) + tma_machine_clears * (1 - tm= a_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -878,11 +947,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -895,20 +964,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -938,7 +1007,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -956,7 +1031,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -964,7 +1039,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -972,7 +1047,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -980,7 +1055,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -988,7 +1063,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -996,7 +1071,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1041,7 +1116,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1051,7 +1126,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1098,7 +1173,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1116,7 +1191,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1152,13 +1227,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1177,21 +1252,15 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", @@ -1217,6 +1286,13 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", @@ -1244,7 +1320,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1265,18 +1341,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1294,28 +1370,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1325,13 +1401,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1347,45 +1424,46 @@ "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / cha_0@event\\=3D0x0@ if #has_pmem > 0 e= lse 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1393,7 +1471,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1401,7 +1479,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1415,6 +1493,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1423,12 +1508,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1437,14 +1522,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1454,13 +1540,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1476,33 +1562,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1511,8 +1605,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1521,17 +1624,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "19 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(23 * tma_info_system_core_frequency - 4 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1539,18 +1642,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1567,7 +1670,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1575,15 +1678,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "43.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(66.5 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1591,9 +1718,9 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1604,16 +1731,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1621,8 +1748,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1632,11 +1759,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1650,7 +1777,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1658,8 +1785,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1673,19 +1800,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1693,8 +1828,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1702,7 +1837,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1717,28 +1852,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL= _PMM) > MEM_LOAD_RETIRED.L1_MISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1774,7 +1900,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1782,8 +1908,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1791,8 +1917,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1800,7 +1926,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1809,7 +1935,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1818,32 +1944,32 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(97 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_HITM + 97 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((120 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (120 * = tma_info_system_core_frequency - 23 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "108 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(131 * tma_info_system_core_frequency - 23 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1856,7 +1982,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1865,7 +1991,7 @@ "MetricExpr": "37 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1874,8 +2000,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1884,17 +2010,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1902,8 +2028,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1912,17 +2038,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1939,7 +2065,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1947,7 +2073,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1955,7 +2105,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1964,7 +2114,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1973,8 +2123,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelakex/memory.json b/tools/pe= rf/pmu-events/arch/x86/icelakex/memory.json index 32a3dedb82fb..ec9577cce3ac 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/memory.json @@ -25,7 +25,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -38,7 +37,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -51,7 +49,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -64,7 +61,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +73,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -90,7 +85,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -103,7 +97,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -116,7 +109,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -353,17 +345,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/icelakex/metricgroups.json index cccfcab3425e..4a28187c2896 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -94,6 +97,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -113,10 +117,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +136,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json b/tools/= perf/pmu-events/arch/x86/icelakex/pipeline.json index 74285b6c81e7..f1446f1b67c6 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) calls, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 5a20439450ab..4472efb5bf9f 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -16,7 +16,7 @@ GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core GenuineIntel-6-7[DE],v1.24,icelake,core -GenuineIntel-6-6[AC],v1.26,icelakex,core +GenuineIntel-6-6[AC],v1.27,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E859819E985 for ; Mon, 9 Dec 2024 22:28:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783404; cv=none; b=JzMxVuxSE01XGmb90WxoR2/bEp551+TvCGheM1z65oBAV2sxxnH2IHYLRJYuH+dHeBdTmNhFYjG/y8QBOwyz1kAVYwTQGzrHBsBlzlqJExjIsZB7w6uHr+sDcftvCA66QCL40O4VRfWYMlYqMPels7DN+c6oF9jCQ2WBrUWloHk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783404; c=relaxed/simple; bh=LetzKxM2fyq4T8+mp7E3YKrwnK+6gQGOXJqY6TfUUDw=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=XZ7zp2G5+gXRjqrFRwzVE11rn/xmbzZ6ugGVnno8UcncelNJYNHY1aNg8jpkns/HeLEyv9djZ4iUSXmpHhrkPsq06x3oN53y9RdU9Mei+NULgrk2LjMVhFelu0HjcOI9gP5byvqMRPYU79SrjFTnMnZf+wY8yCNqxInWQFrFcx8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=b31ec8OX; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="b31ec8OX" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6efed40a56aso11138757b3.3 for ; Mon, 09 Dec 2024 14:28:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783324; x=1734388124; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=oCT/53ktuwq6y22X9Qqyf3erg3GNr4SopV5T1PAznkY=; b=b31ec8OXcUEDpjz/ClBFBZaADO77wQehxqt9tcjSpBKD+MrsFghKjEguABND/gg4Io Y+rUdVbWZ5kpTQqlM2m5gH9c+dDsKicxn4HEwaLvPwNHh21FOwD/nvCuAh6WY+UI0uux mYeM22t42PZVjFT00ACKv6xfX+wLNzpLiwO/YkMtqpMdJro8ncZ/fcyqw4fvCBBZ1b41 BatrHGNJFPUIhJRz4U9o/ExqB6hkzuCed7uUZ6WbLuWR+jWb/GtdHAVMWzxb7D4THohF s1uj3De+Ex5L7+XWNqo+nnaHEQfYRga5kvlgRflN1L68B1qsltC70eh/Awrty2d9VP+Z hsyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783324; x=1734388124; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=oCT/53ktuwq6y22X9Qqyf3erg3GNr4SopV5T1PAznkY=; b=biZwnyjN2M2coPZlBkzxxllD2OJ/h2EVRWBy0M2zkmo/HKA0cNHO/wet1KHArGhDYC TLhAlxBKuxTk1Vdhhk1ViuQFAA8gyDpcZyLDaEm8rifKUyXHfASfmG2Vn9dxlRQ7B/dO YtS5FvqfahS2kVez3alf3D6eacDkTBj5gjuxxYac4h21VjQLp/dSQnt64+bDYqB8q71U UtQ0ol7kzYIIIvgFPmNQCqmdol6TUqy+y55ky3csok2EB5TWtrLBJlUFwyFiBY+BS6Fb FvsWbn9Nm+W/BrXUmRbTOwf0agfLY0sU3QDR0NSdnGDtYAfKW9dSIZTvurizhVTnHl6Q 8FSw== X-Forwarded-Encrypted: i=1; AJvYcCVBMtSNPw7L8IyR1Z2ekm+ShLyYp6QIa25lDV3MKOzHTCGRsjjDEyQpu1CLp+5/pnh8kAjuXeIZwzk6u0Y=@vger.kernel.org X-Gm-Message-State: AOJu0YzHVhGtQzjcGD/Di4NNFnezKpcXnrRexqlwJFfLrxH77kds2oKl ukLl0bIseGy39Yvt1b1zUFGqGky4nyH/HN4sRThJTtxwIWG5S1HoA31gg9XFKihiExGtLDKZytT 8gafSjA== X-Google-Smtp-Source: AGHT+IF9Odm3MVIbSGR5cmvf3mOXaCUt1XEi5FrfonB6OhjXEaPKnJE94uAEKNJb48r+2+kuy81IMVKYsPpo X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:2712:b0:6ef:5754:49ef with SMTP id 00721157ae682-6efe3bcaefcmr226877b3.1.1733783324447; Mon, 09 Dec 2024 14:28:44 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:52 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-16-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 15/22] perf vendor events: Update/add Lunarlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.01 to v1.09. Add TMA metrics 5.01. Bring in the event updates v1.09: https://github.com/intel/perfmon/commit/cbc3b0dc19e8fc52c9604f1da301648ed69= f012b https://github.com/intel/perfmon/commit/28f4b24f9152a0ee1fb3435535628384ad8= 81c22 https://github.com/intel/perfmon/commit/172900e962fdd34ddb80879f4f91add5f77= 3ca29 https://github.com/intel/perfmon/commit/dab0308f7a27d2c644e08d63436b790a207= fb22e The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/lunarlake/cache.json | 1322 ++++- .../arch/x86/lunarlake/floating-point.json | 484 ++ .../arch/x86/lunarlake/frontend.json | 654 ++- .../arch/x86/lunarlake/lnl-metrics.json | 4611 +++++++++++++++++ .../pmu-events/arch/x86/lunarlake/memory.json | 262 +- .../arch/x86/lunarlake/metricgroups.json | 150 + .../pmu-events/arch/x86/lunarlake/other.json | 496 +- .../arch/x86/lunarlake/pipeline.json | 2104 +++++++- .../arch/x86/lunarlake/uncore-memory.json | 18 + .../arch/x86/lunarlake/virtual-memory.json | 428 ++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 11 files changed, 10218 insertions(+), 313 deletions(-) create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/floating-point= .json create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.js= on create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.j= son create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.= json diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/cache.json index 759714618e08..9bad6a354b2b 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/cache.json @@ -1,4 +1,255 @@ [ + { + "BriefDescription": "Counts the number of request that were not ac= cepted into the L2Q because the L2Q is FULL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x31", + "EventName": "CORE_REJECT_L2Q.ANY", + "PublicDescription": "Counts the number of (demand and L1 prefetch= ers) core requests rejected by the L2Q due to a full or nearly full w condi= tion which likely indicates back pressure from L2Q. It also counts request= s that would have gone directly to the XQ, but are rejected due to a full o= r nearly full condition, indicating back pressure from the IDI link. The L= 2Q may also reject transactions from a core to insure fairness between cor= es, or to delay a cores dirty eviction when the address conflicts incoming = external snoops. (Note that L2 prefetcher requests that are dropped are no= t counted by this event.)", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L1D cacheline (dirty) ev= ictions caused by load misses, stores, and prefetches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x51", + "EventName": "DL1.DIRTY_EVICTION", + "PublicDescription": "Counts the number of L1D cacheline (dirty) e= victions caused by load misses, stores, and prefetches. Does not count evi= ctions or dirty writebacks caused by snoops. Does not count a replacement = unless a (dirty) line was written back.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines replaced in = L0 data cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.L0_REPLACEMENT", + "PublicDescription": "Counts L0 data line replacements including o= pportunistic replacements, and replacements that require stall-for-replace = or block-for-replace.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cachelines replaced into the L0 and L1 d-cach= e. Successful replacements only (not blocked) and exclude WB-miss case", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.REPLACEMENT", + "PublicDescription": "Counts cachelines replaced into the L0 and L= 1 d-cache.", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D Fill Buffer (FB) unavailability.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.FB_FULL", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D Fill Buffer (FB) unavailability. Demand requests include= cacheable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D due to lack of L2 resources.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.L2_STALLS", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D due to lack of L2 resources. Demand requests include cac= heable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of demand requests that missed L1D cac= he", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.LOAD", + "PublicDescription": "Count occurrences (rising-edge) of DCACHE_PE= NDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increas= es* (at FB alloc or promotion).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of L1D misses that are outstanding", + "Counter": "2", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD", + "PublicDescription": "Counts number of L1D misses that are outstan= ding in each cycle, that is each cycle the number of Fill Buffers (FB) outs= tanding required by Demand Reads. FB either is held by demand loads, or it = is held by non-demand loads and gets hit at least once by demand. The valid= outstanding interval is defined until the FB deallocation by one of the fo= llowing ways: from FB allocation, if FB is allocated by demand from the dem= and Hit FB, if it is allocated by hardware or software prefetch. Note: In t= he L1D, a Demand Read contains cacheable or noncacheable demand loads, incl= uding ones causing cache-line splits and reads due to page walks resulted f= rom any request type.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with L1D load Misses outstanding.", + "Counter": "2", + "CounterMask": "1", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD_CYCLES", + "PublicDescription": "Counts duration of L1D miss outstanding in c= ycles.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache lines filling L2", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.ALL", + "PublicDescription": "Counts the number of L2 cache lines filling = the L2. Counting does not cover rejects.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Exclusive state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.E", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Forward state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.F", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Invalid state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.I", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Invalid state, does not count lines that go Inval= id due to an eviction", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Modified state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.M", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Shared state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.S", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = evicted due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= evicted due to an L2 cache fill. Increments on the core that brought the l= ine in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = silently dropped due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= silently dropped due to an L2 cache fill. Increments on the core that bro= ught the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that have= been L2 hardware prefetched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of L2 cache lines that hav= e been L2 hardware prefetched but not used by demand accesses. Increments = on the core that brought the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by either the L2 Stream or AMP that were throttled due to Dynamic Prefetch= Throttling. The throttle requestor/source could be from the uncore/SOC or = the Dead Block Predictor. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DPT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by the L2 Stream that were throttled due to Demand Throttle Prefetcher. DT= P Global Triggered with no Local Override. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DTP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by the L2 Stream and not throttled by DTP due to local override. These pre= fetches may still be throttled due to another throttler mechanism besides D= TP. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DTP_OVERRIDE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by either the L2 Stream or AMP that were throttled due to exceeding the XQ = threshold set by either XQ_THRESOLD_DTP or XQ_THRESHOLD. Counts on a per co= re basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.XQ_THRESH", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of demand and prefetch tran= sactions that the External Queue (XQ) rejects due to a full or near full co= ndition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x30", + "EventName": "L2_REJECT_XQ.ANY", + "PublicDescription": "Counts the number of demand and prefetch tra= nsactions that the External Queue (XQ) rejects due to a full or near full c= ondition which likely indicates back pressure from the IDI link. The XQ ma= y reject transactions from the L2Q (non-cacheable requests), BBL (L2 misses= ) and WOB (L2 write-back victims).", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of L2 Cache Accesses Counts= the total number of L2 Cache Accesses - sum of hits, misses, rejects fron= t door requests for CRd/DRd/RFO/ItoM/L2 Prefetches only, per core event", "Counter": "0,1,2,3,4,5,6,7", @@ -9,69 +260,672 @@ "UMask": "0x7", "Unit": "cpu_atom" }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that r= esulted in a Hit from a front door request only (does not include rejects o= r recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of total L2 Cache Accesses = that resulted in a Miss from a front door request only (does not include re= jects or recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_RQSTS.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that m= iss the L2 and get BBL reject short and long rejects (includes those count= ed in L2_reject_XQ.any), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.REJECTS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Demand Data Read access L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD", + "PublicDescription": "Counts Demand Data Read requests accessing t= he L2 cache. These requests may hit or miss L2 cache. True-miss exclude mis= ses that were merged with ongoing L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0xe1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ANY", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache misses when fetching instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.CODE_RD_MISS", + "PublicDescription": "Counts L2 cache misses when fetching instruc= tions.", + "SampleAfterValue": "200003", + "UMask": "0x24", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests that hit L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT", + "PublicDescription": "Counts the number of demand Data Read reques= ts initiated by load instructions that hit L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS", + "PublicDescription": "Counts demand Data Read requests with true-m= iss in the L2 cache. True-miss excludes misses that were merged with ongoin= g L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_REQUEST.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_REQUEST.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.REFERENCES", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "RFO requests that miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.RFO_MISS", + "PublicDescription": "Counts the RFO (Read-for-Ownership) requests= that miss L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x22", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when L1D is locked", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x42", + "EventName": "LOCK_CYCLES.CACHE_LOCK_DURATION", + "PublicDescription": "This event counts the number of cycles when = the L1D is locked.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x2e", "EventName": "LONGEST_LAT_CACHE.MISS", - "PublicDescription": "Counts the number of cacheable memory reques= ts that miss in the Last Level Cache (LLC). Requests include demand loads, = reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the= platform has an L3 cache, the LLC is the L3 cache, otherwise it is the L2 = cache. Counts on a per core basis.", + "PublicDescription": "Counts the number of cacheable memory reques= ts that miss in the Last Level Cache (LLC). Requests include demand loads, = reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the= core has access to an L3 cache, the LLC is the L3 cache, otherwise it is t= he L2 cache. Counts on a per core basis.", "SampleAfterValue": "200003", "UMask": "0x41", "Unit": "cpu_atom" }, { - "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.MISS", - "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", - "SampleAfterValue": "100003", - "UMask": "0x41", - "Unit": "cpu_core" + "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = core has access to an L3 cache, the LLC is the L3 cache, otherwise it is th= e L2 cache. Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x4f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x4f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache. Includes L2 Hit resulting from and L1D eviction= of another core in the same module which is longer latency than a typical = L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an L1 demand load miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache. Includes L2 Hit resultin= g from and L1D eviction of another core in the same module which is longer = latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled to a store buffer full condition", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.SBFULL", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts all retired load instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_LOADS", + "PublicDescription": "Counts Instructions with at least one archit= ecturally visible load retired.", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_STORES", + "PublicDescription": "Counts all retired store instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired software prefetch instructions.", + "Counter": "0,1,2,3", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_SWPF", + "PublicDescription": "Counts all retired software prefetch instruc= tions.", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All retired memory instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ANY", + "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", + "SampleAfterValue": "1000003", + "UMask": "0x87", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with locked access.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.LOCK_LOADS", + "PublicDescription": "Counts retired load instructions with locked= access.", + "SampleAfterValue": "100007", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that split across a= cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that split across = a cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_STORES", + "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that hit the STLB.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", + "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that hit the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", + "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0xa", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that miss the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", + "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x11", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that miss the STLB= .", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", + "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x12", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were a cross-core Snoop hits and forwards data from an in on-package core c= ache (induced by NI$)", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were a cross-core Snoop hits and forwards data from an in on-packag= e core cache (induced by NI$)", + "SampleAfterValue": "20011", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were HitM responses from shared L3, Hit-with-FWD is normally excluded.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", + "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3, Hit-with-FWD is normally exclud= ed.", + "SampleAfterValue": "20011", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 hit and cross-core snoop missed in on-pkg core cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 and cross-core snoop hits in on-pkg core cache", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions which data source i= s memory side cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_L3_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "100007", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions with at least 1 uncachea= ble load or lock.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd4", + "EventName": "MEM_LOAD_MISC_RETIRED.UC", + "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", + "SampleAfterValue": "100007", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of completed demand load requests that= missed the L1, but hit the FB(fill buffer), because a preceding miss to th= e same cacheline initiated the line to be brought into L1, but data is not = yet ready in L1.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.FB_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L1 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired load instructions with at leas= t one uop that hit in the Level 1 of the L1 data cache.", + "Counter": "0,1,2,3", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT_L1", + "SampleAfterValue": "1000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L1 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L2 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_HIT", + "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L2 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_MISS", + "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", + "SampleAfterValue": "100021", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L3 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", + "SampleAfterValue": "100021", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L3 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", + "SampleAfterValue": "50021", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_LOAD_UOPS_LLC_MISS_RETIRED.MEMSIDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_L3_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss the L2 cache, missed the Memory Side Cache and hit in DRAM", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss the LLC cache and hit in the Memory Side Cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_LLC_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "PublicDescription": "Counts the number of load ops retired that h= it in the L2 cache. Includes L2 Hit resulting from and L1D eviction of ano= ther core in the same module which is longer latency than a typical L2 hit.= ", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to load buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to RSV full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to store buffer full", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.REFERENCE", - "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = platform has an L3 cache, the LLC is the L3 cache, otherwise it is the L2 c= ache. Counts on a per core basis.", - "SampleAfterValue": "200003", - "UMask": "0x4f", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "BriefDescription": "MEM_STORE_RETIRED.L2_HIT", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.REFERENCE", - "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", - "SampleAfterValue": "100003", - "UMask": "0x4f", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Retired load instructions.", - "Counter": "0,1,2,3", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", - "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", - "SampleAfterValue": "1000003", - "UMask": "0x81", + "BriefDescription": "Number of cache-lines required by retired sto= res whose Data Source is: Memory Side Cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "100021", "Unit": "cpu_core" }, { - "BriefDescription": "Retired store instructions.", - "Counter": "0,1,2,3", + "BriefDescription": "Counts the number of memory uops retired. A = single uop that performs both a load AND a store will be counted as 1, not = 2 (e.g. ADD [mem], CONST)", + "Counter": "0,1,2,3,4,5,6,7", "Data_LA": "1", "EventCode": "0xd0", - "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", - "PublicDescription": "Counts all retired store instructions.", - "SampleAfterValue": "1000003", - "UMask": "0x82", - "Unit": "cpu_core" + "EventName": "MEM_UOPS_RETIRED.ALL", + "SampleAfterValue": "200003", + "UMask": "0x83", + "Unit": "cpu_atom" }, { "BriefDescription": "Counts the number of load uops retired.", @@ -79,7 +933,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x81", "Unit": "cpu_atom" @@ -90,24 +943,10 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x82", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", - "Counter": "0,1,2,3,4,5,6,7", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_1024", - "MSRIndex": "0x3F6", - "MSRValue": "0x400", - "PEBS": "2", - "SampleAfterValue": "200003", - "UMask": "0x5", - "Unit": "cpu_atom" - }, { "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", "Counter": "0,1,2,3,4,5,6,7", @@ -116,7 +955,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -129,20 +967,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", - "SampleAfterValue": "200003", - "UMask": "0x5", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", - "Counter": "0,1,2,3,4,5,6,7", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_2048", - "MSRIndex": "0x3F6", - "MSRValue": "0x800", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -155,7 +979,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -168,7 +991,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -181,7 +1003,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -194,7 +1015,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -207,7 +1027,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -220,20 +1039,373 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory renamed load uops= retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.MRN_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory renamed store uop= s retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.MRN_STORES", + "SampleAfterValue": "200003", + "UMask": "0xa", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that fi= lled the STLB - includes those in DTLB_LOAD_MISSES submasks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired (store= STLB miss)", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", "Counter": "0,1,2,3,4,5,6,7", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x6", "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired memory uops for any access", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe5", + "EventName": "MEM_UOP_RETIRED.ANY", + "PublicDescription": "Number of retired micro-operations (uops) fo= r load or store memory accesses", + "SampleAfterValue": "1000003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by mem side cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.MEMSIDE_CACHE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11F80000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches which forwarded th= e unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x20001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y mem side cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.MEMSIDE_CACHE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11F80000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Any memory transaction that reached the SQ.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS", + "PublicDescription": "Counts memory transactions reached the super= queue including requests initiated by the core, all L3 prefetches, page wa= lks, etc..", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand and prefetch data reads", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DATA_RD", + "PublicDescription": "Counts the demand and prefetch data reads. A= ll Core Data Reads include cacheable 'Demands' and L2 prefetchers (not L3 p= refetchers). Counting also covers reads due to page walks resulted from any= request type.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cacheable and Non-Cacheable code read request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", + "PublicDescription": "Counts both cacheable and Non-Cacheable code= read requests.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests sent to uncore", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD", + "PublicDescription": "Counts the Demand Data Read requests sent to= uncore. Use it in conjunction with OFFCORE_REQUESTS_OUTSTANDING to determi= ne average latency in the uncore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand RFO requests including regular RFOs, l= ocks, ItoM", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_RFO", + "PublicDescription": "Counts the demand RFO (read for ownership) r= equests including regular RFOs, locks, ItoM.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when offcore outstanding cacheable Cor= e Data Read transactions are present in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "PublicDescription": "Counts cycles when offcore outstanding cache= able Core Data Read transactions are present in the super queue. A transact= ion is considered to be in the Offcore outstanding state between L2 miss an= d transaction completion sent to requestor (SQ de-allocation). See correspo= nding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 outstanding demand da= ta read request is pending.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA= _RD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding demand rfo re= ads transactions in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO", + "PublicDescription": "Counts the number of offcore outstanding dem= and rfo Reads transactions in the super queue every cycle. The 'Offcore out= standing' state of the transaction lasts from the L2 miss until the sending= transaction completion to requestor (SQ deallocation). See the correspondi= ng Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding Code Reads transactions i= n the SuperQueue (SQ), queue to uncore, every cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = outstanding demand data read requests pending.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= outstanding demand data read requests pending. Requests are considered o= utstanding from the time they miss the core's L2 cache until the transactio= n completion message is sent to the requestor.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store Read transactions pending for off-core.= Highly correlated.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO", + "PublicDescription": "Counts the number of off-core outstanding re= ad-for-ownership (RFO) store transactions every cycle. An RFO transaction i= s considered to be in the Off-core outstanding state between L2 cache miss = and transaction completion.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts bus locks, accounts for cache line spl= it locks and UC locks.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to= enforce cache coherency for certain memory accesses that need to be done a= tomically. Can be created by issuing an atomic instruction (via the LOCK p= refix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of PREFETCHNTA, PREFETCHW, = PREFETCHT0, PREFETCHT1 or PREFETCHT2 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.ANY", + "SampleAfterValue": "100003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHNTA instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.NTA", + "PublicDescription": "Counts the number of PREFETCHNTA instruction= s executed.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHW instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.PREFETCHW", + "PublicDescription": "Counts the number of PREFETCHW instructions = executed.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT0 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T0", + "PublicDescription": "Counts the number of PREFETCHT0 instructions= executed.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructio= ns executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T1_T2", + "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT= 2 instructions executed.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json new file mode 100644 index 000000000000..26eceb52a437 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json @@ -0,0 +1,484 @@ +[ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when floating-point divide unit is bus= y executing divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.FPDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for floating-point operatio= ns only.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active floating point di= viders per cycle in the loop stage.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts all microcode FP assists.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.FP", + "PublicDescription": "Counts all microcode Floating Point assists.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.SSE_AVX_MIX", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.SSE_AVX_MIX", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 1st VEC= port (port 0). FP-arith-uops are of type ADD*/SUB*/MUL/FMA*/DPP.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V0", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 2nd VEC= port (port 1)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V1", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 3rd VEC= port (port 5)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V2", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 4th VEC= port", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V3", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.4_FLOPS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.VECTOR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_128B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_256B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 128-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 2 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice = as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 2 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as the= y perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR re= gister need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single precision floating-point instructions retired; some instructions w= ill count twice as noted below. Each count represents 4 computation operat= ions, one for each element. Applies to SSE* and AVX* packed single precisi= on floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT = DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they pe= rform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 4 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perf= orm 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 = calculations per element. The DAZ and FTZ flags in the MXCSR register need = to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed single precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 8 computation= operations, one for each element. Applies to SSE* and AVX* packed single = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 8 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single and 256-bit packed double precision FP instructions retired; some = instructions will count twice as noted below. Each count represents 2 or/a= nd 4 computation operations, 1 for each element. Applies to SSE* and AVX* = packed single precision and packed double precision FP instructions: ADD SU= B HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DP= P and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision and 256-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 2 or/and 4 computation operations, one for each element. = Applies to SSE* and AVX* packed single precision floating-point and packed= double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL= DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB ins= tructions count twice as they perform 2 calculations per element. The DAZ a= nd FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floati= ng-point instructions retired; some instructions will count twice as noted = below. Applies to SSE* and AVX* scalar, double and single precision floati= ng-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB instructions count twice as they perform multiple c= alculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision and double precision floating-point instructions retired; some = instructions will count twice as noted below. Each count represents 1 comp= utational operation. Applies to SSE* and AVX* scalar single precision float= ing-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB= . FM(N)ADD/SUB instructions count twice as they perform 2 calculations per= element. The DAZ and FTZ flags in the MXCSR register need to be set when u= sing these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= double precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar double precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructi= ons count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational scalar doubl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar double precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element. The DAZ and FTZ flags= in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= single precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar single precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB= instructions count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar single precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instr= uctions count twice as they perform 2 calculations per element. The DAZ and= FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic in= structions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic i= nstructions. The DAZ and FTZ flags in the MXCSR register need to be set wh= en using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_128B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_256B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit double precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on all flo= ating point ports.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x1f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P0", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 1.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P1", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 2.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P2", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P3", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0, 1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x1e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer store data port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.STD", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_atom" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json b/tools= /perf/pmu-events/arch/x86/lunarlake/frontend.json index 0327bece0f94..07bd38a1904e 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json @@ -1,4 +1,446 @@ [ + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Clears due to Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered w= hen it finds a branch instruction in a fetch line. This is called Unknown B= ranch which occurs for the first time a branch instruction is fetched or wh= en the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a condit= ional jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.COND", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to an indir= ect branch.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a return= branch.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.RETURN", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a direct= , unconditional jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.UNCOND", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stalls caused by changing prefix length of th= e instruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.LCP", + "PublicDescription": "Counts cycles that the Instruction Length de= coder (ILD) stalls occurred due to dynamically changing prefix length of th= e decoded instruction (by operand size prefix instruction 0x66, address siz= e prefix instruction 0x67 or REX.W for Intel64). Count is proportional to t= he number of prefixes in a 16B-line. This may result in a three-cycle penal= ty for each LCP (Length changing prefix) in a 16-byte chunk.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles the Microcode Sequencer is busy.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.MS_BUSY", + "SampleAfterValue": "500009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of times a decode restricti= on reduces the decode throughput due to wrong instruction length prediction= .", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe9", + "EventName": "DECODE_RESTRICTION.PREDECODE_WRONG", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "DSB-to-MITE switch true penalty cycles.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x61", + "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", + "PublicDescription": "Decode Stream Buffer (DSB) is a Uop-cache th= at holds translations of previously fetched instructions that were decoded = by the legacy x86 decode pipeline (MITE). This event counts fetch penalty c= ycles when a transition occurs from DSB to MITE.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged with having preceded with frontend bound behavior", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ALL", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced DSB miss= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x1", + "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles/empty issue slots due= to a baclear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles /empty issue slots du= e to a btclear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged following an ms flow due to the bubble/wasted issue slot from= exiting long ms flow", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.CISC", + "PublicDescription": "Counts the number of instructions retired t= hat were tagged following an ms flow due to the bubble/wasted issue slot fr= om exiting long ms flow", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged every cycle the decoder is unable to send 4 uops", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced a critic= al DSB miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x11", + "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ica= che miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced iTLB tru= e miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x14", + "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L1 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L1I_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x12", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L2 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L2_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x13", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 128 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", + "MSRIndex": "0x3F7", + "MSRValue": "0x608006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 16 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", + "MSRIndex": "0x3F7", + "MSRValue": "0x601006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions after front-end starvati= on of at least 2 cycles", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", + "MSRIndex": "0x3F7", + "MSRValue": "0x600206", + "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 256 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", + "MSRIndex": "0x3F7", + "MSRValue": "0x610006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end had at least 1 bubble-slot for a period of 2= cycles which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", + "MSRIndex": "0x3F7", + "MSRValue": "0x100206", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 32 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", + "MSRIndex": "0x3F7", + "MSRValue": "0x602006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 4 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", + "MSRIndex": "0x3F7", + "MSRValue": "0x600406", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 512 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", + "MSRIndex": "0x3F7", + "MSRValue": "0x620006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 64 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", + "MSRIndex": "0x3F7", + "MSRValue": "0x604006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 8 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", + "MSRIndex": "0x3F7", + "MSRValue": "0x600806", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MISP_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "ANT retired branches that got just mispredic= ted", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts flows delivered by the Microcode Seque= ncer", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MS_FLOWS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instruction retired tagg= ed after a wasted issue slot if none of the previous events occurred", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles/empty issue slots due= to a predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced STLB (2n= d level TLB) true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.STLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x15", + "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that caused clears due t= o being Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", + "MSRIndex": "0x3F7", + "MSRValue": "0x17", + "PublicDescription": "Number retired branch instructions that caus= ed the front-end to be resteered when it finds the instruction in a fetch l= ine. This is called Unknown Branch which occurs for the first time a branch= instruction is fetched or when the branch is not tracked by the BPU (Branc= h Prediction Unit) anymore.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_HIT", + "PublicDescription": "Counts the number of instructions retired th= at were tagged because empty issue slots were seen before the uop due to In= struction L1 cache miss, that hit in the L2 cache. Includes L2 Hit resulti= ng from and L1D eviction of another core in the same module which is longer= latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that hit in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that also missed the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", "Counter": "0,1,2,3,4,5,6,7", @@ -8,6 +450,15 @@ "UMask": "0x3", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are present.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", "Counter": "0,1,2,3,4,5,6,7", @@ -18,13 +469,212 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALLS", + "PublicDescription": "Counts cycles where a code line fetch is sta= lled due to an L1 instruction cache miss. The decode pipeline works at a 32= Byte granularity.", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ICACHE_DATA.STALL_PERIODS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALL_PERIODS", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache tag miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x83", + "EventName": "ICACHE_TAG.STALLS", + "PublicDescription": "Counts cycles where a code fetch is stalled = due to L1 instruction cache tag miss.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Decode Stream Buffer (DSB) is deliveri= ng any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) p= ath.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles DSB is delivering optimal number of Uo= ps", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the D= SB (Decode Stream Buffer) path. Count includes uops that may 'bypass' the I= DQ.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from the Decode Stream Buffer (DSB) path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.DSB_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipe= line) path. During these cycles uops are not being delivered from the Decod= e Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering optimal number of U= ops", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the M= ITE (legacy decode pipeline) path. During these cycles uops are not being d= elivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from MITE path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MITE_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the MITE path. This also means that uops are= not being delivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when uops are being delivered to IDQ w= hile MS is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_CYCLES_ANY", + "PublicDescription": "Counts cycles during which uops are being de= livered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS= ) is busy. Uops maybe initiated by Decode Stream Buffer (DSB) or MITE.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of switches from DSB or MITE to the MS= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_SWITCHES", + "PublicDescription": "Number of switches from DSB (Decode Stream B= uffer) or MITE (legacy decode pipeline) to the Microcode Sequencer.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops initiated by MITE or Decode Stream Buffe= r (DSB) and delivered to Instruction Decode Queue (IDQ) while Microcode Seq= uencer (MS) is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MS_UOPS", + "PublicDescription": "Counts the number of uops initiated by MITE = or Decode Stream Buffer (DSB) and delivered to Instruction Decode Queue (ID= Q) while the Microcode Sequencer (MS) is busy. Counting includes uops that = may 'bypass' the IDQ.", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that when no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", "Counter": "0,1,2,3,4,5,6,7,8,9", "EventCode": "0x9c", "EventName": "IDQ_BUBBLES.CORE", - "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that when no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to IDQ_BUBBLES.STARVATION_CYCLES]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "Deprecated": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when optimal number of uops was delive= red to the back-end when the back-end is not stalled", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_FE_WAS_OK", + "Invert": "1", + "PublicDescription": "Counts the number of cycles when the optimal= number of uops were delivered by the Instruction Decode Queue (IDQ) to the= back-end of the pipeline when there was no back-end stalls.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are delivered by the IDQ = for 2 or more cycles when backend of the machine is not stalled - normally = indicating a Fetch Latency issue", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.FETCH_LATENCY", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls for 2 or more cycles - normally in= dicating a Fetch Latency issue.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are not delivered by the = IDQ when backend of the machine is not stalled [This event is alias to IDQ_= BUBBLES.CYCLES_0_UOPS_DELIV.CORE]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.STARVATION_CYCLES", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls. [This event is alias to IDQ_BUBBL= ES.CYCLES_0_UOPS_DELIV.CORE]", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times entered into a uco= de flow in the FEC. Includes inserted flows due to front-end detected faul= ts or assists.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_ENTRY", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times nanocode flow is e= xecuted.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.NANO_CODE", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json b/to= ols/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json new file mode 100644 index 000000000000..6782c4e054cd --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json @@ -0,0 +1,4611 @@ +[ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_allocation_restriction", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + t= ma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + = tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb_full)= ) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_l1_latency_cap= acity + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bou= nd * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_= fwd_blk + tma_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_la= tency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_store_bou= nd / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sharing + t= ma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_memory_boun= d * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_store_latency / (tma_store_latency + tm= a_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)= ))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_m= icrocode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb= _full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_store_latency + tma_false_sharing + tma_split_stores + tma_streaming_st= ores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ + = IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_UOPS_= DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0x30@ = / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipflop", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 256-bit in= struction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.256B_DP@ + cpu_atom@FP_INST_RETIRED.256B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx256", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / cpu_atom@INST_RETI= RED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;Clocks_Retired;MemoryLat;TopdownL4;tma_L4_gro= up;tma_l1_bound_group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / tma_i= nfo_thread_clks", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ /= tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (I= DQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_inf= o_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + t= ma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + = tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb_full)= ) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_l1_latency_cap= acity + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bou= nd * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_= fwd_blk + tma_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_la= tency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_store_bou= nd / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sharing + t= ma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_memory_boun= d * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_store_latency / (tma_store_latency + tm= a_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)= ))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_m= icrocode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb= _full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_store_latency + tma_false_sharing + tma_split_stores + tma_streaming_st= ores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ + = IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_UOPS_= DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0x30@ = / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / tma_i= nfo_thread_clks", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ /= tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (I= DQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_inf= o_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json b/tools/p= erf/pmu-events/arch/x86/lunarlake/memory.json index 3d12e226d5ef..60daff922a89 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/memory.json @@ -1,13 +1,168 @@ [ { - "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DL1 miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to other block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer is stalled due to other block cases such as = pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a pagewalk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a store address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to store data forward block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_DATA", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to request buffers full or loc= k in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to reques= t buffers full or lock in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to disambiguations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears due to memory orderi= ng conflicts.", "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "PublicDescription": "Counts the number of Machine Clears detected= dye to memory ordering. Memory Ordering Machine Clears may apply when a me= mory read may not conform to the memory ordering rules of the x86 architect= ure", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING_FAST", + "SampleAfterValue": "20003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1", @@ -15,13 +170,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 128 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1", @@ -29,13 +183,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 16 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -43,13 +196,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 2048 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", "MSRIndex": "0x3F6", "MSRValue": "0x800", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "23", "UMask": "0x1", @@ -57,13 +209,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 256 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1", @@ -71,13 +222,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 32 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -85,13 +235,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 4 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1", @@ -99,13 +248,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 512 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1", @@ -113,13 +261,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 64 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1", @@ -127,13 +274,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 8 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1", @@ -145,54 +291,112 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "Counts cacheable demand data reads were not s= upplied by the L3 cache.", + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache and were su= pplied by the system memory (DRAM, MSC, or MMIO).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x13FBFC00004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache and were supplied by the system memory (DRAM, MSC, or MM= IO).", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3FBFC00001", + "MSRValue": "0x13FBFC00001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache and were supplied by the system memory (DRAM, MSC, or MM= IO).", "Counter": "0,1,2,3", "EventCode": "0x2A,0x2B", "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFE7F8000001", + "MSRValue": "0x9E7FA000001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts demand reads for ownership, including = SWPREFETCHW which is an RFO were not supplied by the L3 cache.", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache and were supplied by the system memory (DRAM, MS= C, or MMIO).", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3FBFC00002", + "MSRValue": "0x13FBFC00002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache.", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache and were supplied by the system memory (DRAM, MS= C, or MMIO).", "Counter": "0,1,2,3", "EventCode": "0x2A,0x2B", "EventName": "OCR.DEMAND_RFO.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFE7F8000002", + "MSRValue": "0x9E7FA000002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where data return is pending for a Dem= and Data Read request who miss L3 cache.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_L3_MISS_DEM= AND_DATA_RD", + "PublicDescription": "Cycles with at least 1 Demand Data Read requ= ests who miss L3 cache in the superQ.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = demand data read requests pending that are known to have missed the L3 cach= e.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json new file mode 100644 index 000000000000..855585fe6fae --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json @@ -0,0 +1,150 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Ifetch": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Mem_Exec": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "load_store_bound": "Grouping from Top-down Microarchitecture Analysis= Metrics spreadsheet", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", + "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_resource_bound_group": "Metrics contributing to tma_resource_boun= d category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/other.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/other.json index 0b49b4684c4b..667707d4fe37 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/other.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/other.json @@ -1,6 +1,345 @@ [ { - "BriefDescription": "Counts cacheable demand data reads Catch all = value for any response types - this includes response types not define in t= he OCR. If this is set all other response types will be ignored", + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.PAGE_FAULT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.PAGE_FAULT", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted cycles a Core i= s blocked due to a lock In Progress issued by another core", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.BLOCKED_CYCLES", + "PublicDescription": "Counts the number of unhalted cycles a Core = is blocked due to a lock In Progress issued by another core. Counts on a pe= r core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles a Core i= s blocked due to an Accepted lock it issued, includes both split and non-sp= lit lock cycles.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.LOCK_CYCLES", + "PublicDescription": "Counts the number of unhalted cycles a Core = is blocked due to an Accepted lock it issued, includes both split and non-s= plit lock cycles. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of non-split locks such as = UC locks issued by a Core (does not include cache locks)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.NON_SPLIT_LOCKS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of split locks issued by a = Core", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.SPLIT_LOCKS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= s are at throttle level 0", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL0_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 1", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL1_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 2", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL2_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 3", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL3_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 4", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL4_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on all Int= eger ports.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P0", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 1.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P1", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 2.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P2", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P3", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches that were= throttled due to Dynamic Prefetch Throttling. The throttle requestor/sou= rce could be from the uncore/SOC or the Dead Block Predictor. Counts on a p= er core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DPT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to Demand Throttle Prefetcher. DTP Global Triggered with no Local Ove= rride. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches not throt= tled by DTP due to local override. These prefetches may still be throttled= due to another throttler mechanism. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP_OVERRIDE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to LLC hit rate in . Counts on a per core basis= .", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.HIT_RATE", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to exceeding the XQ threshold set by either XQ_THRESOLD_DTP or LLC_XQ_= THRESHOLD. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.XQ_THRESH", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all requests that have any type of res= ponse.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.ALL_REQUESTS.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFF0000001DFFF", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts writebacks of modified cachelines that= have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10008", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts writebacks of non-modified cachelines = that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_NONM.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", @@ -22,7 +361,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Counts cacheable demand data reads were suppl= ied by DRAM.", + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.DRAM", @@ -44,7 +383,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Counts demand reads for ownership, including = SWPREFETCHW which is an RFO Catch all value for any response types - this i= ncludes response types not define in the OCR. If this is set all other res= ponse types will be ignored", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", @@ -64,5 +403,156 @@ "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts full streaming stores (64 bytes, WCiLF= ) that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts partial streaming stores (less than 64= bytes, WCiL) that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number issue slots not consumed d= ue to a color request for an FCW or MXCSR control register when all 4 colo= rs (copies) are already in use", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.COLOR_STALLS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles the uncore cannot take further request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x2d", + "EventName": "XQ.FULL", + "PublicDescription": "number of cycles when the thread is active a= nd the uncore cannot take any further requests (for example prefetches, loa= ds or stores initiated by the Core that miss the L2 cache).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand request.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.ALL", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand code read.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.CRDS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand read.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.DRDS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand RFO.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.RFOS", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/lunarlake/pipeline.json index 220c2115fec9..59e2ae652b2d 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json @@ -1,325 +1,2002 @@ [ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.DIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer and floating-po= int operations.", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles when any of the i= nteger dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when integer divide unit is busy execu= ting divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.IDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer operations only= .", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.ANY", + "PublicDescription": "Counts the number of occurrences where a mic= rocode assist is invoked by hardware. Examples include AD (page Access Dirt= y), FP and AVX related assists.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003", "Unit": "cpu_atom" }, { - "BriefDescription": "All branch instructions retired.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc4", - "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts all branch instructions retired.", - "SampleAfterValue": "400009", - "Unit": "cpu_core" + "BriefDescription": "All branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired JCC (Jump on Conditional Code)= branch instructions retired includes both taken and not taken branches", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Conditional branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "PublicDescription": "Counts conditional branch instructions retir= ed.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of not taken JCC branch ins= tructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "SampleAfterValue": "200003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Not taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts not taken branch instructions retired= .", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken conditional branch instructions retired= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional branch instructions= retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken backward conditional branch instruction= s retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional branch ins= tructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken forward conditional branch instructions= retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional branch inst= ructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x102", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and Interrupt call and ret= urn", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Far branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "PublicDescription": "Counts far branch instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Indirect near branch instructions retired (ex= cluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Direct and indirect near call instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "PublicDescription": "Counts both direct and indirect near call in= structions retired.", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Return instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "PublicDescription": "Counts return instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken branch instruction= s retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts taken branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= This precise event may be used to get the misprediction cost via the Retir= e_Latency field of PEBS. It fires on the instruction that immediately follo= ws the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", + "SampleAfterValue": "400009", + "UMask": "0x44", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC branch = instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired. This precise event may be used to get the misprediction cost via t= he Retire_Latency field of PEBS. It fires on the instruction that immediate= ly follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_COST", + "SampleAfterValue": "400009", + "UMask": "0x151", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted not taken J= CC branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "SampleAfterValue": "200003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired. This precise event may be used to get the misprediction = cost via the Retire_Latency field of PEBS. It fires on the instruction that= immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC b= ranch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional mispredict= ed branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward. This precise event may be used to get t= he misprediction cost via the Retire_Latency field of PEBS. It fires on the= instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8001", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken conditional branch instruc= tions retired. This precise event may be used to get the misprediction cost= via the Retire_Latency field of PEBS. It fires on the instruction that imm= ediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x141", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional mispredicte= d branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward. This precise event may be used to get th= e misprediction cost via the Retire_Latency field of PEBS. It fires on the = instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8002", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Miss-predicted near indirect branch instructi= ons retired (excluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", + "SampleAfterValue": "400009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired. This prec= ise event may be used to get the misprediction cost via the Retire_Latency = field of PEBS. It fires on the instruction that immediately follows the mis= predicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", + "SampleAfterValue": "400009", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted near indirect branch instruction= s retired (excluding returns). This precise event may be used to get the mi= sprediction cost via the Retire_Latency field of PEBS. It fires on the inst= ruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_COST", + "SampleAfterValue": "100003", + "UMask": "0xc0", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of near branch instructions retired th= at were mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken near branch instructions r= etired. This precise event may be used to get the misprediction cost via th= e Retire_Latency field of PEBS. It fires on the instruction that immediatel= y follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x60", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts the number of mispredicted = ret instructions retired. Non PEBS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted ret instructions retired. This p= recise event may be used to get the misprediction cost via the Retire_Laten= cy field of PEBS. It fires on the instruction that immediately follows the = mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET_COST", + "SampleAfterValue": "100007", + "UMask": "0x48", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of BTCLEARS.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe8", + "EventName": "BTCLEAR.ANY", + "PublicDescription": "Counts the total number of BTCLEARS which oc= curs when the Branch Target Buffer (BTB) predicts a taken branch.", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 li= ght-weight slower wakeup time but more power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C01", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 light-weight slower wakeup time but more power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.2 li= ght-weight faster wakeup time but less power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C02", + "PublicDescription": "Counts core clocks when the thread is in the= C0.2 light-weight faster wakeup time but less power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 or= C0.2 or running a PAUSE in C0 ACPI state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C0_WAIT", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 or C0.2 power saving optimized states (TPAUSE or UMWAIT instructions)= or running the PAUSE instruction.", + "SampleAfterValue": "2000003", + "UMask": "0x70", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the core is not in a halt st= ate.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when a PAUSE is pending.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Pause instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE_INST", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the thread is not in a halt = state.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles while memory subsystem has an outstand= ing load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "16", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.CYCLES_MEM_ANY", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total execution stalls.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "4", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.STALLS_TOTAL", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.1_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 1 uop = was executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 or 3 uops are executed on a= ll ports and Reservation Station (RS) was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_3_PORTS_UTIL", + "SampleAfterValue": "2000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 2 uops= were executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 3 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.3_PORTS_UTIL", + "PublicDescription": "Cycles total of 3 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 4 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.4_PORTS_UTIL", + "PublicDescription": "Cycles total of 4 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Execution stalls while memory subsystem has a= n outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "5", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_LOADS", + "SampleAfterValue": "2000003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where the Store Buffer was full and no= loads caused an execution stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "2", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_STORES", + "PublicDescription": "Counts cycles where the Store Buffer was ful= l and no loads caused an execution stall.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles no uop executed while RS was not empty= , the SB was not full and there was no outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.EXE_BOUND_0_PORTS", + "PublicDescription": "Number of cycles total of 0 uops executed on= all ports, Reservation Station (RS) was not empty, the Store Buffer (SB) w= as not full and there was no outstanding load.", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction decoders utilized in a cycle", + "Counter": "2", + "EventCode": "0x75", + "EventName": "INST_DECODED.DECODERS", + "PublicDescription": "Number of decoders utilized in a cycle when = the MITE (legacy decode pipeline) fetches instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired.", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "retired macro-fused uops when there is a bran= ch in the macro-fused pair (the two instructions that got macro-fused count= once in this pmon)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.BR_FUSED", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INST_RETIRED.MACRO_FUSED", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.MACRO_FUSED", + "SampleAfterValue": "2000003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired NOP instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.NOP", + "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.PREC_DIST", + "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Iterations of Repeat string retired instructi= ons.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BPClear.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.BPCLEAR_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0xB", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Clears speculative count", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due = to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles after recovery from a branch mi= sprediction or machine clear till the first uop is issued from the resteere= d path.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEAR_RESTEER_CYCLES", + "PublicDescription": "Cycles after recovery from a branch mispredi= ction or machine clear till the first uop is issued from the resteered path= .", + "SampleAfterValue": "500009", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core cycles the allocator was stalled due to = recovery from earlier clear event for this thread", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.RECOVERY_CYCLES", + "PublicDescription": "Counts core cycles when the Resource allocat= or was stalled due to recovery from an earlier branch misprediction or mach= ine clear event.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0x7", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots where uops got dropped", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UOP_DROPPING", + "PublicDescription": "Estimated number of Top-down Microarchitectu= re Analysis slots that got dropped due to non front-end reasons", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.128BIT", + "SampleAfterValue": "1000003", + "UMask": "0x13", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 256-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.256BIT", + "SampleAfterValue": "1000003", + "UMask": "0xac", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 128-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_128", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 128-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 256-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_256", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 256-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.MUL_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.MUL_256", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.SHUFFLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.SHUFFLES", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_128", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_128", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_256", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "False dependencies in MOB due to partial comp= are on address.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "PublicDescription": "Counts the number of times a load got blocke= d due to false dependencies in MOB due to partial compare on address.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad was blocked for any of the following reasons: utlb_miss, 4k_alias, unkn= own_sta/bad_fwd, unready_fwd (includes md blocks and esp consuming load blo= cks)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address exactly matches an older store whose da= ta is not ready (a.k.a. unknown). unready_fwd", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The number of times that split load operation= s are temporarily blocked because all resources for handling the split acce= sses are in use.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.NO_SR", + "PublicDescription": "Counts the number of times that split load o= perations are temporarily blocked because all resources for handling the sp= lit accesses are in use.", + "SampleAfterValue": "100003", + "UMask": "0x88", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", + "SampleAfterValue": "100003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of demand loads that match = on a wcb (request buffer) allocated by an L1 hardware prefetch", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x4c", + "EventName": "LOAD_HIT_PREFETCH.HW_PF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Uops delivered by the LSD, but didn't = come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_ACTIVE", + "PublicDescription": "Counts the cycles when at least one uop is d= elivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles optimal number of Uops delivered by th= e LSD, but did not come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_OK", + "PublicDescription": "Counts the cycles when optimal number of uop= s is delivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops delivered by the LSD.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa8", + "EventName": "LSD.UOPS", + "PublicDescription": "Counts the number of uops delivered to the b= ack-end by the LSD(Loop Stream Detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all machine clears for any reason incl= uding, but not limited to memory ordering, SMC, and FP assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY", + "SampleAfterValue": "20003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY_FAST", + "SampleAfterValue": "20003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears (nukes) of any type.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.COUNT", + "PublicDescription": "Counts the number of machine clears (nukes) = of any type.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to an internal load passing an older store within the = same CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION_FAST", + "SampleAfterValue": "20003", + "UMask": "0x88", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of nukes due to memory rena= ming", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE", + "SampleAfterValue": "20003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE_FAST", + "SampleAfterValue": "20003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times that the machine c= lears due to a page fault. Covers both I-Side and D-Side (Loads/Stores) pa= ge faults. A page fault occurs when either the page is not present, or an = access violation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Self-modifying code (SMC) detected.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "PublicDescription": "Counts self-modifying code (SMC) detected, w= hich causes a machine clear.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "LFENCE instructions retired", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe0", + "EventName": "MISC2_RETIRED.LFENCE", + "PublicDescription": "number of LFENCE retired instructions", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of LBR entries recorded. Re= quires LBRs to be enabled in IA32_LBR_CTL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "LBR record is inserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of CLFLUSH, CLWB, and CLDEM= OTE instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.CL_INST", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LFENCE instructions reti= red.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.LFENCE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of RDPMC, RDTSC, and RDTSCP= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.RDPMC_RDTSC_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count the number of WRMSR instructions retire= d.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.WRMSR", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of faults and software inte= rrupts with vector < 32.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.FAULT_ALL", + "PublicDescription": "Counts the number of faults and software int= errupts with vector < 32, including VOE cases.", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PSB+ nuke events and ToP= A trap events.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.INTEL_PT_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of accesses to KeyLocker ca= che.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.KEYLOCKER_ACCESS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of misses to KeyLocker cach= e.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.KEYLOCKER_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of user interrupts delivere= d.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.ULI_DELIVERY", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of SENDUIPI instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.ULI_SENDUIPI", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of VM exits.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.VM_EXIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.NON_C01_MS_SCB", + "PublicDescription": "Counts the number of issue slots not consume= d by the backend due to a micro-sequencer (MS) scoreboard, which stalls the= front-end from issuing from the UROM until a specified older uop retires. = The most commonly executed instruction with an MS scoreboard is PAUSE.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", + "SampleAfterValue": "10000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= s.", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BAD_SPEC_SLOTS", + "PublicDescription": "Number of slots of TMA method that were wast= ed due to incorrect speculation. It covers all types of control-flow or dat= a-related mis-speculations.", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= by branch mispredictions", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS", + "PublicDescription": "Number of TMA slots that were wasted due to = incorrect speculation by (any type of) branch mispredictions. This event es= timates number of speculative operations that were issued but not retired a= s well as the out-of-order engine recovery past a branch misprediction.", + "SampleAfterValue": "10000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TOPDOWN.MEMORY_BOUND_SLOTS", + "Counter": "3", + "EventCode": "0xa4", + "EventName": "TOPDOWN.MEMORY_BOUND_SLOTS", + "SampleAfterValue": "10000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", + "Counter": "Fixed counter 3", + "EventName": "TOPDOWN.SLOTS", + "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.SLOTS_P", + "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", + "SampleAfterValue": "10000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of issue slo= ts not consumed by the backend because allocation is stalled due to a mispr= edicted jump or a machine clear.", + "Counter": "36", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL", + "PublicDescription": "Fixed Counter: Counts the number of issue sl= ots that were not consumed by the backend because allocation is stalled due= to a mispredicted jump or a machine clear. Counts all issue slots blocked= during this recovery window including relevant microcode flows and while u= ops are not yet available in the IQ. Also, includes the issue slots that we= re consumed by the backend but were thrown away because they were younger t= han the mispredict or machine clear.", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to ROB full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REORDER_BUFFER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls.", + "Counter": "37", + "EventName": "TOPDOWN_FE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x9c", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", + "Counter": "38", + "EventName": "TOPDOWN_RETIRING.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x72", + "EventName": "TOPDOWN_RETIRING.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "BriefDescription": "Counts the number of consumed retirement slot= s.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xc5", - "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", - "SampleAfterValue": "200003", + "EventCode": "0xc2", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", "Unit": "cpu_atom" }, { - "BriefDescription": "All mispredicted branch instructions retired.= ", + "BriefDescription": "Number of non dec-by-all uops decoded by deco= der", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc5", - "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", - "SampleAfterValue": "400009", + "EventCode": "0x76", + "EventName": "UOPS_DECODED.DEC0_UOPS", + "PublicDescription": "This event counts the number of not dec-by-a= ll uops decoded by decoder 0.", + "SampleAfterValue": "1000003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.CORE", - "SampleAfterValue": "2000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Core cycles when the core is not in a halt st= ate.", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.CORE", - "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "BriefDescription": "Uops executed on INT EU ALU ports.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.ALU", + "PublicDescription": "Number of ALU integer uops dispatch to execu= tion.", "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.CORE_P", + "BriefDescription": "Uops executed on any INT EU ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.INT_EU_ALL", + "PublicDescription": "Number of integer uops dispatched to executi= on.", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "BriefDescription": "Number of Uops dispatched/executed by any of = the 3 JEUs (all ups that hold the JEU including macro; micro jumps; fetch-f= rom-eip)", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.CORE_P", - "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.JMP", + "PublicDescription": "Number of jump uops dispatch to execution", "SampleAfterValue": "2000003", + "UMask": "0x40", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles", - "Counter": "Fixed counter 2", - "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "BriefDescription": "Uops executed on Load ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.LOAD", + "PublicDescription": "Number of Load uops dispatched to execution.= ", "SampleAfterValue": "2000003", - "UMask": "0x3", - "Unit": "cpu_atom" + "UMask": "0x4", + "Unit": "cpu_core" }, { - "BriefDescription": "Reference cycles when the core is not in halt= state.", - "Counter": "Fixed counter 2", - "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "BriefDescription": "Number of (shift) 1-cycle Uops dispatched/exe= cuted by any of the Shift Eus", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SHIFT", + "PublicDescription": "Number of SHIFT integer uops dispatch to exe= cution", "SampleAfterValue": "2000003", - "UMask": "0x3", + "UMask": "0x20", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted reference clock= cycles", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", - "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "BriefDescription": "Number of Uops dispatched/executed by Slow EU= (e.g. 3+ cycles LEA, >1 cycles shift, iDIVs, CR; *H operation)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SLOW", + "PublicDescription": "Number of Slow integer uops dispatch to exec= ution.", "SampleAfterValue": "2000003", - "UMask": "0x1", - "Unit": "cpu_atom" + "UMask": "0x8", + "Unit": "cpu_core" }, { - "BriefDescription": "Reference cycles when the core is not in halt= state.", + "BriefDescription": "Number of Uops dispatched on STA ports", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", - "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STA", + "PublicDescription": "Number of STA (Store Address) uops dispatch = to execution", "SampleAfterValue": "2000003", - "UMask": "0x1", + "UMask": "0x80", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.THREAD", + "BriefDescription": "Uops executed on STD ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STD", + "PublicDescription": "Number of STD (Store Data) uops dispatch to = execution", "SampleAfterValue": "2000003", - "UMask": "0x2", - "Unit": "cpu_atom" + "UMask": "0x10", + "Unit": "cpu_core" }, { - "BriefDescription": "Core cycles when the thread is not in a halt = state.", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.THREAD", - "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "BriefDescription": "Cycles where at least 1 uop was executed per-= thread", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_1", + "PublicDescription": "Cycles where at least 1 uop was executed per= -thread.", "SampleAfterValue": "2000003", - "UMask": "0x2", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "BriefDescription": "Cycles where at least 2 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "2", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_2", + "PublicDescription": "Cycles where at least 2 uops were executed p= er-thread.", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.THREAD_P", - "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "BriefDescription": "Cycles where at least 3 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_3", + "PublicDescription": "Cycles where at least 3 uops were executed p= er-thread.", "SampleAfterValue": "2000003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", - "Counter": "Fixed counter 0", - "EventName": "INST_RETIRED.ANY", - "PEBS": "1", + "BriefDescription": "Cycles where at least 4 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "4", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_4", + "PublicDescription": "Cycles where at least 4 uops were executed p= er-thread.", "SampleAfterValue": "2000003", "UMask": "0x1", - "Unit": "cpu_atom" + "Unit": "cpu_core" }, { - "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", - "Counter": "Fixed counter 0", - "EventName": "INST_RETIRED.ANY", - "PEBS": "1", - "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "BriefDescription": "Counts number of cycles no uops were dispatch= ed to be executed on this thread.", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.STALLS", + "Invert": "1", + "PublicDescription": "Counts cycles during which no uops were disp= atched from the Reservation Station (RS) per thread.", "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of instructions retired", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xc0", - "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", + "BriefDescription": "Counts the number of uops to be executed per-= thread each cycle.", + "Counter": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.THREAD", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "BriefDescription": "Counts the number of x87 uops dispatched.", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc0", - "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", - "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.X87", + "PublicDescription": "Counts the number of x87 uops executed.", "SampleAfterValue": "2000003", + "UMask": "0x10", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "BriefDescription": "When 4-uops are requested and only 2-uops are= delivered, the event counts 2. Uops_issued correlates to the number of RO= B entries. If uop takes 2 ROB slots it counts as 2 uops_issued.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x03", - "EventName": "LD_BLOCKS.STORE_FORWARD", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x2", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "SampleAfterValue": "200003", "Unit": "cpu_atom" }, { - "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "BriefDescription": "Uops that RAT issues to RS", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x03", - "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", - "SampleAfterValue": "100003", - "UMask": "0x82", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of LBR entries recorded. Re= quires LBRs to be enabled in IA32_LBR_CTL.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xe4", - "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", - "SampleAfterValue": "1000003", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops that the Resource = Allocation Table (RAT) issues to the Reservation Station (RS).", + "SampleAfterValue": "2000003", "UMask": "0x1", - "Unit": "cpu_atom" + "Unit": "cpu_core" }, { - "BriefDescription": "LBR record is inserted", + "BriefDescription": "UOPS_ISSUED.CYCLES", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xe4", - "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", - "SampleAfterValue": "1000003", + "CounterMask": "1", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.CYCLES", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa4", - "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", - "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", - "SampleAfterValue": "10000003", - "UMask": "0x2", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of uops retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.ALL", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" }, { - "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", - "Counter": "Fixed counter 3", - "EventName": "TOPDOWN.SLOTS", - "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", - "SampleAfterValue": "10000003", - "UMask": "0x4", + "BriefDescription": "Cycles with retired uop(s).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.CYCLES", + "PublicDescription": "Counts cycles where at least one uop has ret= ired.", + "SampleAfterValue": "1000003", + "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "BriefDescription": "Retired uops except the last uop of each inst= ruction.", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa4", - "EventName": "TOPDOWN.SLOTS_P", - "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", - "SampleAfterValue": "10000003", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.HEAVY", + "PublicDescription": "Counts the number of retired micro-operation= s (uops) except the last uop of each instruction. An instruction that is de= coded into less than two uops does not contribute to the count.", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x73", - "EventName": "TOPDOWN_BAD_SPECULATION.ALL", - "SampleAfterValue": "1000003", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x73", - "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", - "SampleAfterValue": "1000003", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "BriefDescription": "Counts the number of integer divide uops reti= red", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa4", - "EventName": "TOPDOWN_BE_BOUND.ALL", - "SampleAfterValue": "1000003", - "UMask": "0x2", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "BriefDescription": "Counts the number of uops retired that were d= elivered by the loop stream detector (LSD).", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa4", - "EventName": "TOPDOWN_BE_BOUND.ALL_P", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls", - "Counter": "37", - "EventName": "TOPDOWN_FE_BOUND.ALL", - "SampleAfterValue": "1000003", - "UMask": "0x6", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.LSD", + "SampleAfterValue": "2000003", + "UMask": "0x4", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x9c", - "EventName": "TOPDOWN_FE_BOUND.ALL_P", - "SampleAfterValue": "1000003", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", - "Counter": "38", - "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_atom" + "BriefDescription": "UOPS_RETIRED.MS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of consumed retirement slot= s.", - "Counter": "0,1,2,3,4,5,6,7", + "BriefDescription": "Number of non-speculative switches to the Mic= rocode Sequencer (MS)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", "EventCode": "0xc2", - "EventName": "TOPDOWN_RETIRING.ALL_P", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" + "EventName": "UOPS_RETIRED.MS_SWITCHES", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "PublicDescription": "Switches to the Microcode Sequencer", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" }, { "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that are utilized by operations that eventually get retired (commi= tted) by the processor pipeline. Usually, this event positively correlates = with higher performance for example, as measured by the instructions-per-c= ycle metric.", @@ -330,5 +2007,26 @@ "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles without actually retired uops.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.STALLS", + "Invert": "1", + "PublicDescription": "This event counts cycles without actually re= tired uops.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json b/= tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json new file mode 100644 index 000000000000..9046b680649d --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json @@ -0,0 +1,18 @@ +[ + { + "BriefDescription": "Read CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x22", + "EventName": "UNC_M_CAS_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Write CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x23", + "EventName": "UNC_M_CAS_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json index 59af79e3466e..defa3a967754 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json @@ -1,4 +1,70 @@ [ + { + "BriefDescription": "Counts the number of page walks initiated by = a demand load that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "PublicDescription": "Counts loads that miss the DTLB (Data TLB) a= nd hit the STLB (Second level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccount for 4k page size only. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT_4K", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccount for large page sizes only. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT_LGPG", + "SampleAfterValue": "200003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a demand load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a demand load.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -19,6 +85,125 @@ "UMask": "0xe", "Unit": "cpu_core" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data loads. This implies address translations missed in the= DTLB and further levels of TLB. The page walk can end with or without a fa= ult.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a demand= load in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a demand load in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a store that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts the number of first level TLB misses = but second level hits due to a demand load that did not start a page walk. = Accounts for all page sizes. Will result in a DTLB write from STLB.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stores that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts stores that miss the DTLB (Data TLB) = and hit the STLB (2nd Level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a store.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a store.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -39,6 +224,134 @@ "UMask": "0xe", "Unit": "cpu_core" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 2M or 4M pages. Includes page walks= that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data stores. This implies address translations missed in th= e DTLB and further levels of TLB. The page walk can end with or without a f= ault.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a store = in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a store in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of times there was an ITLB = miss and a new translation was filled into the ITLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x81", + "EventName": "ITLB.FILLS", + "PublicDescription": "Counts the number of times the machine was u= nable to find a translation in the Instruction Translation Lookaside Buffer= (ITLB) and a new translation was filled into the ITLB. The event is specul= ative in nature, but will not count translations (page walks) that are begu= n and not finished, or translations that are finished but not filled into t= he ITLB.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction fetch requests that miss the ITLB= and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.STLB_HIT", + "PublicDescription": "Counts instruction fetch requests that miss = the ITLB (Instruction TLB) and hit the STLB (Second-level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x120", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for code (instruction fetch) request.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a code (instruction fetch) request= .", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -58,5 +371,120 @@ "SampleAfterValue": "100003", "UMask": "0xe", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (2M/4M)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M page size= s) caused by a code fetch. This implies it missed in the ITLB (Instruction = TLB) and further levels of TLB. The page walk can end with or without a fau= lt.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (4K)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K page sizes) = caused by a code fetch. This implies it missed in the ITLB (Instruction TLB= ) and further levels of TLB. The page walk can end with or without a fault.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for an outst= anding code request in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for an outstanding code (instruction fetch) request in the PMH (Page Miss H= andler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a load gets = blocked because of a micro TLB miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DTLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DTLB miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PMH walks that hit in th= e L1 or WCBs", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbc", + "EventName": "PAGE_WALKER_LOADS.DTLB_L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PMH walks that hit in th= e L2", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbc", + "EventName": "PAGE_WALKER_LOADS.DTLB_L2_HIT", + "PublicDescription": "Counts the number of PMH walks that hit in t= he L2. Includes L2 Hit resulting from and L1D eviction of another core in = the same module which is longer latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count number of any STLB flush attempts (Enti= re, PCID, InvPage, CR3 write, etc)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbd", + "EventName": "TLB_FLUSHES.STLB_ANY", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 4472efb5bf9f..6a18b70ec19c 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -21,7 +21,7 @@ GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core GenuineIntel-6-(57|85),v16,knightslanding,core -GenuineIntel-6-BD,v1.01,lunarlake,core +GenuineIntel-6-BD,v1.09,lunarlake,core GenuineIntel-6-A[AC],v1.10,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D2AD619E97C for ; Mon, 9 Dec 2024 22:28:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783392; cv=none; b=bGB/yAgJIZhvDJmd0YdbnpDmN42C3dkzs1RK9iu9LlDhOkiAesjK/wuvEY5sMzJo5lQJjhzqSC0kWDcJxir864bCoTVdI6CIMllNsYqmKwl3K5Dj2L6EWGTN2WLb8yE+QH1Cew59LcakKO5KvyuJJbkX+uAb9YHd4K1HfYYYcXQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783392; c=relaxed/simple; bh=3LC45z11Xhv5cAf5AMtbBpVINAFtfzQgIdCgDZmWXS0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=P/3hy2Dss3NtFc16UR4qmZoX73SxpUGoPRkP17fRZFDeW7fPLr6TT03/8kZctnD2SJ3BO+qAYgWOfg5Hx6EW/oNT7hfkTHLsHBHM6qXyzr6e+q1J+xggkQdMGzTPH8LEeyPMARIF8e1VqV0oGJzrTUYU6eq7ews1QVn6Cu2kND4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=w6INlkOa; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="w6INlkOa" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ef84c71a58so54265167b3.1 for ; Mon, 09 Dec 2024 14:28:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783327; x=1734388127; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=+i5EvDiwLpkT517NHEJnRIkD7tFs02tR90LjegSHedk=; b=w6INlkOaxZwMZQpx9mKh5Io+OgNH5Qznf7TfJxim0NTcSUpmlVfXIzPgojzW92ofRV tYTNpVJkfMV407TxdJYb0MQRnLUqVf7dQjxefMiUsC//W8/smwARs+KdKCLJAWJiw8Xs Gzhm/JHtYT6jqcUDFInHBdy85rQteVsA/nzVRpXLFirIFQN5c365rFZ92dZrJjGk4tx7 j1SscMNh7r2pCR+TC31KoAVC8VCQ5YDRasPDPKS7DiA6kDF6YDDK+caMhGX2mI24YbQV rvLCu+cfkwdUxm1jRDnieOw5+wJS9GyoB/rpF/KB4YOYN1M7rPayuoErzQqqZaoGuoam iimg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783327; x=1734388127; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=+i5EvDiwLpkT517NHEJnRIkD7tFs02tR90LjegSHedk=; b=aMkUHo2aWN4JdXS9EQNMaEkCKqMGpeuMPQa2puxIy0ibZ8SKbXT4nzHz6GBIrJAfuP /pfUyeekEU0MogqJfDSnczpeza367OsmTsPmyenKRIZk+uoZ5zy/UhZePGEeZHV2IdAO tg37wp00H9Pe6lp6jYF/PpBaelhL7uexpeqYfWq+CwDSJjNkgUMOZDnUMJ5wqd2rAGrN 46JZF/LHcTUO+uVgt151oAjeDsag5qWF0REiVO2w9kah2bvaRvzmGFE2lKXxsa0lvL5R 4WiVGvtjGHDYqGk7yeQOCbMhx8oExWmY4oNLNu1fSgiYjX+fR7CwU8PsQPVUqxsYPl7y m5nQ== X-Forwarded-Encrypted: i=1; AJvYcCUSK8LENClYfHAftSmy3NWZuE4XI7oOfJq19nLvqvxq96NLaKcbusxbAHdsjo9H9JDPPgKk5LQ8y0iwvJY=@vger.kernel.org X-Gm-Message-State: AOJu0Yy3hoVINZuLkF26NL8cwxXtPQzVTaMg02WstyGbgjA4luHDRMY5 1/wVLXBNQ9snhZ3Y8bhybkT+FFdBGwWLrJgzBDzrLfnjKuldycybFvw/lUCjAFRP/hHHIRO0g4E UxV2hJA== X-Google-Smtp-Source: AGHT+IE3n6uZ0UsrjeW7lf4SjU0rLr4wPZoBV/CCcM4DSvUyOHQP+XVDKqvlgUAWIj65AKisSVAs1w1l0lOp X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a81:be05:0:b0:6ee:fbef:239 with SMTP id 00721157ae682-6f025861d83mr170387b3.5.1733783326987; Mon, 09 Dec 2024 14:28:46 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:53 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-17-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 16/22] perf vendor events: Update Meteorlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.10 to v1.11. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.11: https://github.com/intel/perfmon/commit/b9dabd05ff44af24fde0682e16d1a716c93= 2f0d0 This updates the mapfile.csv for the 0xB5 CPUID variant of meteorlake. https://github.com/intel/perfmon/commit/c3094bc9bbaff30071874a492afc3369554= d572e The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/meteorlake/cache.json | 20 + .../arch/x86/meteorlake/frontend.json | 9 + .../arch/x86/meteorlake/metricgroups.json | 10 +- .../arch/x86/meteorlake/mtl-metrics.json | 3818 +++++++++++++---- .../pmu-events/arch/x86/meteorlake/other.json | 22 + .../arch/x86/meteorlake/pipeline.json | 57 +- 7 files changed, 3162 insertions(+), 776 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 6a18b70ec19c..3cded351f56a 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -22,7 +22,7 @@ GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core GenuineIntel-6-(57|85),v16,knightslanding,core GenuineIntel-6-BD,v1.09,lunarlake,core -GenuineIntel-6-A[AC],v1.10,meteorlake,core +GenuineIntel-6-(AA|AC|B5),v1.11,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.03,rocketlake,core diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/cache.json index 908e3c7f6d6e..254f69c5a1e8 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/cache.json @@ -1027,6 +1027,26 @@ "UMask": "0x42", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the second Level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store uops retired that = miss in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json b/tool= s/perf/pmu-events/arch/x86/meteorlake/frontend.json index b6c52f7385fc..ed4392cf6d76 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json @@ -539,5 +539,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json b/= tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json index b54a5fc0861f..855585fe6fae 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json @@ -41,6 +41,7 @@ "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -89,7 +90,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +102,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", @@ -121,10 +125,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -138,5 +145,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json b/t= ools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json index 1a7392f0da86..20228a9f28ab 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json @@ -84,54 +84,219 @@ "MetricName": "smi_num", "ScaleUnit": "1SMI#" }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", - "MetricExpr": "tma_core_bound", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL_P@ / (6 * cpu_= atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", "MetricGroup": "TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_atom" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_f= ew_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -140,563 +305,2402 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (6 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (6 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (6 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipflop", + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * cpu_core@BR_MISP_= RETIRED.COND_TAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (28 * tma_= info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cp= u_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RETIRED.XSNP= _FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_fre= quency) * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HI= T.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipbranch", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", - "MetricName": "tma_info_br_inst_mix_ipcall", + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL4;tma_L4_g= roup;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", - "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", - "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired return Branch Mispre= diction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired Branch Misprediction= ", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipmispredict", + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (6 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", - "Unit": "cpu_atom" - }, + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", - "MetricName": "tma_info_core_cpi", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "cpu_atom@FP_FLOPS_RETIRED.ALL@ / cpu_atom@CPU_CLK_U= NHALTED.CORE@", + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (6 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", "MetricGroup": "Flops", - "MetricName": "tma_info_core_flopc", + "MetricName": "tma_info_arith_inst_mix_ipflop", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", - "MetricName": "tma_info_core_ipc", + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", "Unit": "cpu_atom" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / cpu_atom@INST_RE= TIRED.ANY@", - "MetricName": "tma_info_core_upi", + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@ / = cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@ / cp= u_atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;Metric;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "Count;SMT", + "MetricName": "tma_info_core_core_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Ret;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / cpu_atom@INST_RE= TIRED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ + = cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@) / cpu_atom@MEM_BOUND_STALLS_IFE= TCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@ / = cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ + cp= u_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@) / cpu_atom@MEM_BOUND_STALLS_LOAD.AL= L@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@ / cp= u_atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL_P@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Metric;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D0x1@", + "MetricGroup": "Mem;MemoryBW;SoC;System_Metric", + "MetricName": "tma_info_system_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "Core_Metric;SMT", + "MetricName": "tma_info_system_smt_2t_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "Metric;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_l1_bound", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_load_bound", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (6 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", - "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_store_bound", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;Core_Clocks;TopdownL6;tma_L6_group;tma_alu= _op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_ipload", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", - "MetricName": "tma_info_mem_mix_ipstore", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_locks_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_splits_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (6 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of mem load uops to all uops", - "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL_P@", - "MetricName": "tma_info_mem_mix_memload_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricName": "tma_info_serialization _%_tpause_cycles", + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.REF_TSC@ / TSC", - "MetricName": "tma_info_system_cpu_utilization", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "cpu_atom@FP_FLOPS_RETIRED.ALL@ / (duration_time * 1= e9)", - "MetricGroup": "Flops", - "MetricName": "tma_info_system_gflops", - "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@k / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Summary", - "MetricName": "tma_info_system_kernel_utilization", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@CPU_CLK_= UNHALTED.REF_TSC@", - "MetricGroup": "Power", - "MetricName": "tma_info_system_turbo_utilization", + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are microcode op= s", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (6 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (6 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (6 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", - "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that result = in retirement slots", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", - "MetricgroupNoGroup": "TopdownL1", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -708,8 +2712,8 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", "MetricThreshold": "tma_alu_op_utilization > 0.4", @@ -718,17 +2722,17 @@ }, { "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", - "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", "MetricThreshold": "tma_avx_assists > 0.1", @@ -737,7 +2741,7 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -753,46 +2757,151 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE= / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serial= izing_operation + tma_ports_utilization) + tma_microcode_sequencer / (tma_f= ew_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_microc= ode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -801,28 +2910,100 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * cpu_core@BR_MISP_= RETIRED.COND_TAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM_LOA= D_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency) * (cpu_co= re@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_= HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@)))= * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MI= SS@ / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (28 * tma_= info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cp= u_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RETIRED.XSNP= _FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_fre= quency) * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HI= T.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -833,107 +3014,107 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 24 * tma_info_system_core_frequency) * (1 = - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DAT= A_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH= _FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIR= ED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(cpu_core@IDQ.DSB_CYCLES_ANY@ - cpu_core@IDQ.DSB_CY= CLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * min(cpu= _core@MEM_INST_RETIRED.STLB_HIT_LOADS@R, 7) / tma_info_thread_clks + tma_lo= ad_stlb_miss", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * min(cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@R, 7) / tma_info_thread_clks + tma_= store_stlb_miss", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "cpu_core@L1D_PEND_MISS.FB_FULL@ / tma_info_thread_c= lks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -944,28 +3125,28 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -975,137 +3156,164 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * cpu_core@ASSISTS.FP@ / tma_info_thread_slots", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / cpu_core@BR_MISP_RETIRED.ALL_BRANCHES@ / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", @@ -1113,15 +3321,15 @@ }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", - "MetricExpr": "cpu_core@INT_MISC.CLEARS_COUNT@ / (cpu_core@BR_MISP= _RETIRED.ALL_BRANCHES@ + cpu_core@MACHINE_CLEARS.COUNT@)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_core" @@ -1136,8 +3344,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", @@ -1145,7 +3353,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1154,142 +3362,36 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ + 2 = * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRED.NOP@) / tma_i= nfo_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_h= it_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_b= ig_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY\\,umask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_loc= k_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma= _store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@BR_= INST_RETIRED.NEAR_RETURN@) / cpu_core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_callret", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_NTAKEN@ / cpu_core@BR= _INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_nt", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_TAKEN@ / cpu_core@BR_= INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_tk", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_TAKEN@ - cpu_core@BR= _INST_RETIRED.COND_TAKEN@ - 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@) / cpu_= core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_jump", "Unit": "cpu_core" @@ -1303,50 +3405,50 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(cpu_core@CPU_CLK_UNHALTED.DISTRIBUTED@ if #SMT_on = else tma_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_core_core_clk= s", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_core_coreipc", "Unit": "cpu_core" }, { "BriefDescription": "uops Executed per Cycle", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / tma_info_thread_cl= ks", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", "MetricGroup": "Power", "MetricName": "tma_info_core_epc", "Unit": "cpu_core" }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / tma_info_core_core_clks", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", "MetricGroup": "Flops;Ret", "MetricName": "tma_info_core_flopc", "Unit": "cpu_core" }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", @@ -1354,29 +3456,37 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FRONTEND_RETI= RED.ANY_DSB_MISS@", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", "MetricGroup": "DSBmiss;Fed", "MetricName": "tma_info_frontend_ipdsb_miss_ret", "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", @@ -1384,50 +3494,72 @@ }, { "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@BACLEARS.= ANY@", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", - "MetricExpr": "1e3 * cpu_core@FRONTEND_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.CODE_RD_MISS@ / cpu_core@IN= ST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", "MetricGroup": "Fed;LSD", "MetricName": "tma_info_frontend_lsd_coverage", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch.", - "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", "Unit": "cpu_core" }, { "BriefDescription": "Total number of retired Instructions", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "INST_RETIRED.ANY", "MetricGroup": "Summary;TmaL1;tma_L1_group", "MetricName": "tma_info_inst_mix_instructions", "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", @@ -1435,52 +3567,52 @@ }, { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + cpu_core@FP_ARITH_INST_RETIRED.VECTOR@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.128B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.256B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_DOUBLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_SINGLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", @@ -1488,7 +3620,7 @@ }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", "MetricThreshold": "tma_info_inst_mix_ipcall < 200", @@ -1496,7 +3628,7 @@ }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ = + 4 * cpu_core@FP_ARITH_INST_RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_= RETIRED.256B_PACKED_SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", "MetricThreshold": "tma_info_inst_mix_ipflop < 10", @@ -1504,7 +3636,7 @@ }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_LOADS@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", "MetricThreshold": "tma_info_inst_mix_ipload < 3", @@ -1512,14 +3644,14 @@ }, { "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@CPU_CLK_U= NHALTED.PAUSE_INST@", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_ippause", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_STORES@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", "MetricThreshold": "tma_info_inst_mix_ipstore < 8", @@ -1527,7 +3659,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@SW_PREFETCH_A= CCESS.T0\\,umask\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", @@ -1535,10 +3667,10 @@ }, { "BriefDescription": "Instructions per taken branch", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1572,235 +3704,259 @@ }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@= INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_fb_hpki", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.ALL_DEMAND_DATA_RD@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L2_LINES_IN.ALL@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (cpu_core@L2_RQSTS.REFERENCES@ - cpu_core@L2_= RQSTS.MISS@) / cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_HIT@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_load", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "Backend;CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.MISS@ / cpu_core@INST_RETIR= ED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem;Offcore", "MetricName": "tma_info_memory_l2mpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_MISS@ / cpu_= core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.RFO_MISS@ / cpu_core@INST_R= ETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheMisses;Offcore", "MetricName": "tma_info_memory_l2mpki_rfo", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@OFFCORE_REQUESTS.ALL_REQUESTS@ / 1e9 = / duration_time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@LONGEST_LAT_CACHE.MISS@ / 1e9 / durat= ion_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L3_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_l3mpki", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD@ / cp= u_core@OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_data_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS.DEMAND_DATA_RD@", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAN= D_DATA_RD@ / cpu_core@OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", "MetricGroup": "Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l3_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@MEM_LOAD= _COMPLETED.L1_MISS_ANY@", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency", "Unit": "cpu_core" }, { "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "1e3 * cpu_core@SQ_MISC.BUS_LOCK@ / cpu_core@INST_RE= TIRED.ANY@", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_bus_lock_pki", "Unit": "cpu_core" }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_MISC_RETIRED.UC@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki", "Unit": "cpu_core" }, { "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@L1D_PEND= _MISS.PENDING_CYCLES@", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", "MetricGroup": "Mem;MemoryBW;MemoryBound", "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", "Unit": "cpu_core" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "1e3 * cpu_core@ITLB_MISSES.WALK_COMPLETED@ / cpu_co= re@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricGroup": "Fed;MemoryTLB", "MetricName": "tma_info_memory_tlb_code_stlb_mpki", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "1e3 * cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED@ / c= pu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu_core@ITLB_MISSES.WALK_PENDING@ + cpu_core@DTLB= _LOAD_MISSES.WALK_PENDING@ + cpu_core@DTLB_STORE_MISSES.WALK_PENDING@) / (4= * tma_info_core_core_clks)", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "1e3 * cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED@ / = cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from DSB per c= ycle", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@IDQ.DSB_CYCLES_AN= Y@", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_dsb", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from LSD per c= ycle", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@LSD.CYCLES_ACTIVE@", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_lsd", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from MITE per = cycle", - "MetricExpr": "cpu_core@IDQ.MITE_UOPS@ / cpu_core@IDQ.MITE_CYCLES_= ANY@", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_mite", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per a microcode Assist invocatio= n", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1808,7 +3964,7 @@ }, { "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C0_WAIT@ / tma_info_threa= d_clks", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", "MetricGroup": "C0Wait", "MetricName": "tma_info_system_c0_wait", "MetricThreshold": "tma_info_system_c0_wait > 0.05", @@ -1816,7 +3972,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency", "Unit": "cpu_core" @@ -1830,22 +3986,22 @@ }, { "BriefDescription": "Average number of utilized CPUs", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "Summary", "MetricName": "tma_info_system_cpus_utilized", "Unit": "cpu_core" }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_HAC_ARB_TRK_REQUESTS.ALL + UNC_HAC_ARB_CO= H_TRK_REQUESTS.ALL) / 1e9 / duration_time", + "MetricExpr": "64 * (UNC_HAC_ARB_TRK_REQUESTS.ALL + UNC_HAC_ARB_CO= H_TRK_REQUESTS.ALL) / 1e9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", @@ -1853,22 +4009,23 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -1876,15 +4033,30 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D1@", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", "Unit": "cpu_core" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - cpu_core@CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE@ /= cpu_core@CPU_CLK_UNHALTED.REF_DISTRIBUTED@ if #SMT_on else 0)", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", "MetricGroup": "SMT", "MetricName": "tma_info_system_smt_2t_utilization", "Unit": "cpu_core" @@ -1896,16 +4068,24 @@ "MetricName": "tma_info_system_socket_clks", "Unit": "cpu_core" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "tma_info_thread_clks / cpu_core@CPU_CLK_UNHALTED.RE= F_TSC@", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization", "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", "Unit": "cpu_core" @@ -1915,40 +4095,41 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_thread_clks", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", "MetricGroup": "Ret;Summary", "MetricName": "tma_info_thread_ipc", "Unit": "cpu_core" }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" }, { "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@INS= T_RETIRED.ANY@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", "MetricThreshold": "tma_info_thread_uoppi > 1.05", @@ -1956,10 +4137,19 @@ }, { "BriefDescription": "Uops per taken branch", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { @@ -1968,114 +4158,124 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", - "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L3_HIT@R, 9 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_2_3_10@ / (3 * tma_in= fo_core_core_clks)", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", @@ -2088,36 +4288,63 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@ME= M_INST_RETIRED.LOCK_LOADS@R / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(cpu_core@LSD.CYCLES_ACTIVE@ - cpu_core@LSD.CYCLES_= OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2128,54 +4355,54 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", @@ -2184,27 +4411,27 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "cpu_core@UOPS_RETIRED.MS@ / tma_info_thread_slots", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", - "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_CYCLES_ANY@ - cpu_core@IDQ.MITE_= CYCLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2213,41 +4440,50 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", - "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (cpu_core@BR_INST_RETIRED.AL= L_BRANCHES@ - cpu_core@INST_RETIRED.MACRO_FUSED@) / (tma_retiring * tma_inf= o_thread_slots)", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2263,89 +4499,89 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", - "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", - "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * cpu_core@ASSISTS.PAGE_FAULT@ / tma_info_thread= _slots", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_0@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_1@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_6@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_= ACTIVE@ < cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOU= ND_ON_LOADS@ else (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu= _core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info_thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + max= (cpu_core@RS.EMPTY_RESOURCE@ - cpu_core@RESOURCE_STALLS.SCOREBOARD@, 0)) / = tma_info_thread_clks, 1) * (cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_cor= e@EXE_ACTIVITY.BOUND_ON_LOADS@) / tma_info_thread_clks", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2353,28 +4589,37 @@ { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2385,98 +4630,98 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", - "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * min(cpu_co= re@MEM_INST_RETIRED.SPLIT_LOADS@R, tma_info_memory_load_miss_real_latency) = / tma_info_thread_clks", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * min(cpu_c= ore@MEM_INST_RETIRED.SPLIT_STORES@R, 1) / tma_info_thread_clks", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_4_9@ + cpu_core@UOPS= _DISPATCHED.PORT_7_8@) / (4 * tma_info_core_core_clks)", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", @@ -2489,46 +4734,73 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", - "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/other.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/other.json index 53d23d8decc6..a30172503354 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/other.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/other.json @@ -95,6 +95,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json b/tool= s/perf/pmu-events/arch/x86/meteorlake/pipeline.json index bc806c7330f4..4cfa680f9c96 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -159,6 +159,15 @@ "UMask": "0xfb", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = BR_INST_RETIRED.INDIRECT_CALL", "Counter": "0,1,2,3,4,5,6,7", @@ -209,6 +218,15 @@ "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of near taken branch instru= ctions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, { "BriefDescription": "Taken branch instructions retired.", "Counter": "0,1,2,3,4,5,6,7", @@ -220,6 +238,24 @@ "UMask": "0x20", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near relative JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_JMP", + "SampleAfterValue": "200003", + "UMask": "0xdf", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", @@ -390,6 +426,15 @@ "UMask": "0xc0", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired.", "Counter": "0,1,2,3,4,5,6,7", @@ -1176,6 +1221,16 @@ "UMask": "0x20", "Unit": "cpu_core" }, + { + "BriefDescription": "Cycles stalled due to no store buffers availa= ble. (not including draining form sync).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa2", + "EventName": "RESOURCE_STALLS.SB", + "PublicDescription": "Counts allocation stall cycles caused by the= store buffer (SB) being full. This counts cycles that the pipeline back-en= d blocked uop delivery from the front-end.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", "Counter": "0,1,2,3,4,5,6,7", --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8572219F118 for ; Mon, 9 Dec 2024 22:28:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783366; cv=none; b=rsbWuLvWqqUv4JZhD3qrxJd3uvgnpajo4lX9bVtqeNfCzKEtGcXzJr1BRtTBTrI+baO5h+VQMRbpKvv0k+4hWNSlbnSmBUr3n9yopjQPp8fo2vOXKMzPDPILXtsrszIwBGmm2W+72Sr/8hRog2pFFsxNzMS5w6TDSWkfOeZy/eA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783366; c=relaxed/simple; bh=tz24/zAVaFsb13cH1EONIBfwTrPSJ26pSfNTSsW14+g=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=tlPYKm1+MZvc4I59v6kqDhSn7TksuyMdzF8LQ+FVqDllIJPlG08E7SB9kRNyDPI7GIchDz1fPizIWZeARSPaJ++JWVWHbRWVBRm5rjSf1yu5hCF2gyQR5846iVGInxJqGtnLY42i9KwbJv5r9Ny/+rPkxWIYYstOu5Mxls0CYzU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1hSGxW14; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1hSGxW14" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6efe45a2405so50539807b3.0 for ; Mon, 09 Dec 2024 14:28:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783329; x=1734388129; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=ib0hadqu0ebT7r3wzIIJwVSHqk0W2Yh+rRiWwOoWW+g=; b=1hSGxW145qh5vIg809Wb4W5MsPYWcsBME7hQgWmkhcm5aZQMNy+DFW6JbWz6vVeA/T wCFlNg9t2feiOy3EIcVBZdgZJBOfdUgoCs5rP3rOlz7LOvwhjzDx+oj9PdtGIFH8nZwz S0rzBd/5em5AnYXQMbEJuVdTesIv06D4kFC7Ozp2DMtmlHYfzeiR3fE8T3E+Td1slD/J UWbReM+N0IXcz8CtpbCMpvZo3lf9b4w50QdWR3b1b2zgbfa/WlTvsfSCAhInHDkkRl+l 9iCwpmfCV0twrmjwv9vT2Ur5+VEIMBo3f9dvvgerkINrCnEwTGJ0Fe7EhMKBXzSCUxPB FLjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783329; x=1734388129; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=ib0hadqu0ebT7r3wzIIJwVSHqk0W2Yh+rRiWwOoWW+g=; b=aRFoSMtqLViMhB6EkI5DO37I72R7XPkYGlsZpoW91zbSglXbHcLQuquclu/ay5mmAL W0XZmved8V1+jLyiVl3uPtsQEJF5CfTKGH7T/5BXTpM6wb3bvePXo1rJc8ulHQgjRHdq uW6dZ0RDRjsqw9CYhcDWeorlxcG9CIIYJOsPCs0rRUQvFjL76bDn1UOcHap1Lwtgdg6N cLRWH4uNQCnzMGRNJJXWECMm0iImpwtwG0nTMzYpwG5444C6UJsC4E6O+/CH5G7VqKB9 eW7PoUW5w6PNCW7en/UcGM1E+K09geBHLAsXJKG1Qub8JliTW7JTf5ZrtOz/SvqBMzJi HJjw== X-Forwarded-Encrypted: i=1; AJvYcCVSh4ppSh9VIGJE4KhZEcQgs2hKZnTThb/7K9EJZ0ySxwXF/dTWX4YtAPySC6OV2O5eh+JyCiaduBKiJGs=@vger.kernel.org X-Gm-Message-State: AOJu0Yzduw2zj9po7O5Ad2tRPdZYLckGDQ63oqN2VGlsSFexiN2kQ92H JJcGhqD2H4zqi0ZWeHAA6csvcfwqH28MaDpSAboAMxtkPOmh5aH6iU9gGcB7AJeeB7tZ3WBLOBj omg+Yuw== X-Google-Smtp-Source: AGHT+IEbmjaqmYlA+uCKRGUeGdp7LYKNJhT5/Wmmq0NOaXZNrLcgtvEYlghK1Z1wgaVYxl5x+uB0um8WNb85 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:6902:1614:b0:e39:8c4b:5730 with SMTP id 3f1490d57ef6-e3a0befe4damr67952276.3.1733783329228; Mon, 09 Dec 2024 14:28:49 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:54 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-18-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 17/22] perf vendor events: Update Rocketlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.03 to v1.04. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.04: https://github.com/intel/perfmon/commit/015d5a5eab6850e6367ee4f82e4808e166e= af5a5 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/rocketlake/cache.json | 34 +- .../arch/x86/rocketlake/frontend.json | 17 - .../arch/x86/rocketlake/memory.json | 13 +- .../arch/x86/rocketlake/metricgroups.json | 10 +- .../arch/x86/rocketlake/pipeline.json | 30 +- .../arch/x86/rocketlake/rkl-metrics.json | 849 +++++++++--------- .../x86/rocketlake/uncore-interconnect.json | 10 - .../arch/x86/rocketlake/uncore-other.json | 2 +- .../arch/x86/rocketlake/virtual-memory.json | 18 + 10 files changed, 500 insertions(+), 485 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 3cded351f56a..9cdd8818c087 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -25,7 +25,7 @@ GenuineIntel-6-BD,v1.09,lunarlake,core GenuineIntel-6-(AA|AC|B5),v1.11,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core -GenuineIntel-6-A7,v1.03,rocketlake,core +GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core GenuineIntel-6-8F,v1.23,sapphirerapids,core GenuineIntel-6-AF,v1.04,sierraforest,core diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json b/tools/p= erf/pmu-events/arch/x86/rocketlake/cache.json index 2e93b7835b41..791fa526d192 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/cache.json @@ -75,11 +75,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -251,7 +251,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -262,7 +261,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -273,7 +271,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -284,7 +281,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -295,7 +291,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -306,7 +301,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -317,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -328,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -339,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -350,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -361,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -372,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -383,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -394,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -405,7 +391,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -416,7 +401,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -427,7 +411,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -438,7 +421,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -449,7 +431,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -460,7 +441,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -910,6 +890,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with outstanding code read requests pe= nding.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Cycles with outstanding code read requests p= ending. Code Read requests include both cacheable and non-cacheable Code R= eads. Requests are considered outstanding from the time they miss the core= 's L2 cache until the transaction completion message is sent to the request= or.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles where at least 1 outstanding Demand RF= O request is pending.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json b/tool= s/perf/pmu-events/arch/x86/rocketlake/frontend.json index e7c7d4d4152d..7afa2436d584 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json b/tools/= perf/pmu-events/arch/x86/rocketlake/memory.json index f73035f44330..abaf3f4f9d63 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/memory.json @@ -88,7 +88,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -101,7 +100,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -114,7 +112,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -127,7 +124,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +136,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -153,7 +148,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -166,7 +160,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -179,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -287,17 +279,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json b/= tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json b/tool= s/perf/pmu-events/arch/x86/rocketlake/pipeline.json index 4fdf07c7beb7..44659f26cbab 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/t= ools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json index 13474af97786..cfda8956353e 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,32 +604,11 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "tma_info_botlnk_l0_core_bound_likely", - "MetricGroup": "Cor;Metric;SMT", - "MetricName": "tma_info_botlnk_core_bound_likely", - "MetricThreshold": "tma_info_botlnk_core_bound_likely > 0.5" - }, - { - "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd))", - "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", - "MetricName": "tma_info_botlnk_dsb_misses", - "MetricThreshold": "tma_info_botlnk_dsb_misses > 10" - }, - { - "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", - "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", - "MetricName": "tma_info_botlnk_ic_misses", - "MetricThreshold": "tma_info_botlnk_ic_misses > 5" - }, { "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", "MetricConstraint": "NO_GROUP_EVENTS", @@ -491,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -500,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -509,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -671,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -688,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -737,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -755,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -763,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -771,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -779,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -787,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -795,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -840,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -850,21 +886,9 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, - { - "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "tma_info_memory_mix_bus_lock_pki", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_bus_lock_pki" - }, - { - "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "tma_info_memory_tlb_code_stlb_mpki", - "MetricGroup": "Fed;MemoryTLB;Metric", - "MetricName": "tma_info_memory_code_stlb_mpki" - }, { "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", @@ -889,12 +913,6 @@ "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" }, - { - "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "tma_info_memory_latency_data_l2_mlp", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_data_l2_mlp" - }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", @@ -903,16 +921,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l1d_cache_fill_bw_2t" - }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", @@ -927,16 +939,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l2_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l2_cache_fill_bw_2t" - }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", @@ -975,28 +981,16 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, - { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_access_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", - "MetricName": "tma_info_memory_l3_cache_access_bw_2t" - }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l3_cache_fill_bw_2t" - }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", @@ -1011,52 +1005,22 @@ }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "tma_info_memory_load_l2_miss_latency", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, - { - "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l2_miss_latency" - }, - { - "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_load_l2_mlp" - }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x0@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency" }, - { - "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "tma_info_memory_tlb_load_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_load_stlb_mpki" - }, { "BriefDescription": "\"Bus lock\" per kilo instruction", "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", @@ -1065,7 +1029,7 @@ }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "tma_info_memory_uc_load_pki", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki" }, @@ -1077,17 +1041,11 @@ "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "tma_info_memory_tlb_page_walks_utilization", - "MetricGroup": "Core_Metric;Mem;MemoryTLB", - "MetricName": "tma_info_memory_page_walks_utilization", - "MetricThreshold": "tma_info_memory_page_walks_utilization > 0.5" - }, - { - "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "tma_info_memory_tlb_store_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_store_stlb_mpki" + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1114,15 +1072,9 @@ "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, - { - "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_uc_load_pki" - }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1149,18 +1101,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1178,14 +1130,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1195,13 +1147,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1224,12 +1177,25 @@ "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1237,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1245,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1259,6 +1225,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1266,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1275,14 +1248,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1292,13 +1266,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1314,33 +1288,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1349,8 +1331,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1359,17 +1350,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", + "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1377,18 +1368,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1405,7 +1396,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1413,16 +1404,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1432,7 +1447,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1442,16 +1457,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1459,8 +1474,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1470,11 +1485,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1488,7 +1503,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1496,8 +1511,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1511,19 +1526,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1531,8 +1554,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1540,7 +1563,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1555,19 +1578,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1603,7 +1626,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1611,8 +1634,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1620,8 +1643,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1629,7 +1652,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1638,7 +1661,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1647,14 +1670,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1667,7 +1690,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1676,7 +1699,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1685,8 +1708,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1695,17 +1718,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1713,8 +1736,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1723,17 +1746,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1750,7 +1773,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1758,7 +1781,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1766,7 +1813,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1775,7 +1822,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1784,8 +1831,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json index 3946d4e01a8c..a0057f8d815e 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json @@ -38,16 +38,6 @@ "UMask": "0x2", "Unit": "ARB" }, - { - "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", - "Counter": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, { "BriefDescription": "Each cycle counts number of all outgoing vali= d entries in ReqTrk. Such entry is defined as valid from its allocation in = ReqTrk until deallocation. Accounts for Coherent and non-coherent traffic.", "Counter": "0", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json b/= tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json = b/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json index 3ff51040f84f..9df790d4361f 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0169B19FA93 for ; Mon, 9 Dec 2024 22:28:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783374; cv=none; b=c4lo/dvM7E5zygL4csYMg/HA3+YKBxW7KHS4mCMddldkpx7rAVnsSH31mNNcNHnUOJb5sP7MmTnmbQgNQIoAlihVk0mg7A3YoovnDC67/zDRp5zHKko9MMm/U+UYWiY8uWYB1SP9y76d1qccC9xzwIXhrumr7XRVoPEAq5/FU0w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783374; c=relaxed/simple; bh=8x29XlKQDbSPUkHsKTJnzcJFmckRpBKh20B4S4gg70U=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=iLWZOSME8QsAGz27ZFOij7XKSpdDZbUtMbAtir+AmcpXzCkdcYFbF2nN2Q5FYg42cIPGbVnnghIyJoEHZG4H11tEpkg5GtwoLkMY4E7874RJhlhotxpocNdUjaSeCsSQGIS+4lXwJu2Zr50aCUv6stzW1YnsU3vCeMtzjH65KnQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=LFDHveOu; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="LFDHveOu" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e3810bac8a7so10794725276.0 for ; Mon, 09 Dec 2024 14:28:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783332; x=1734388132; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=UY/OgHt7FXYZhHu+Z82KcZcVT3hSvplc1L2/2I8RPw4=; b=LFDHveOueYyn/gSqSL5D9QGk9dtn5NULXt0pf8yWF2zr+CwyojBrKcBFPSavCPK897 vBMWfBs7m9XN/LQ3m/Sj7YtdiNX1c1vi3i/LAbO+MPaagWR697AoyX5lToszQwM+Ggo2 qS0icC90qlSa+EjNSNjrqoVU7s9ioow0rZfFPjZ2iM9v1iLy+wNNUYYsncbQlDIwzAKU Kp+iFaUdU+H5rwvQ0Xe9EpBSc3uQO/SGRr37jRMTzVvIk0ZJ7MhktE22YB9ji3al5XjK lVjrUDfdlnMceHyDYowVYv3yOHBjB5cduhGuICuI+cOY++o8RPCtXmmQ/TJ1ZG74FIOz 4ceg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783332; x=1734388132; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=UY/OgHt7FXYZhHu+Z82KcZcVT3hSvplc1L2/2I8RPw4=; b=rNIQDO9KIfd6jvP7tIgLnfaG93z3w4gE+5GUGe+AcOIh04eHqQDauyOaLmnZ93a7vb EDgDMmQleMsu3T4z/lVBB/NoC3qVZRSi/vfwTo/QTdvKi/HRHOLNkUFB7sCBl93tvuge qfYTxh+fXslvBd6dB6rwFY//WO6nYdft2Q/XXopY/9S/TrVZ6DRkj+XZMa1vBT1pOdZb c+0Ot8dJcrkO52LxTtJwBm4ZU7Koos0LZxSPylyxsRRbEhRxKTtQf/Z1Vj0pAuO7FMAr vvBJkqYcJuh7tIuIDVT6oQ3dghTx/UzJF/8xTQ/nJaq2bifgjnv9ey9rAIPC0ZljYK24 BrHQ== X-Forwarded-Encrypted: i=1; AJvYcCXYh47Hh49Ozsa8ALgCxbisrUU+icegNimaqbSZ9okBWxM/PK2soB6knhtL2XVl2UE85vA8r7wURhvpT00=@vger.kernel.org X-Gm-Message-State: AOJu0YzX9P0xUxfUp7/w+6fAgTNhee8yMzkzv0NiAiy6H/MKPgYqlBME lTn06qogBcathn3uK0c5JDUiEn/0WuraCmaYctRJz96PrfN/nf7zcA061F3z7Kwt/N3CesRFMB/ 0exaGew== X-Google-Smtp-Source: AGHT+IFpy0CK/U//BiVggO0GXv14tZnTckXRPXs8HrSa0TA849j2LCtI+X3XoULh2IZ8WoVMeA4a/OtZ+jyG X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:664a:0:b0:e39:78c0:3dd with SMTP id 3f1490d57ef6-e3a0b68d558mr96042276.5.1733783331722; Mon, 09 Dec 2024 14:28:51 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:55 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-19-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 18/22] perf vendor events: Update Sapphirerapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.23 to v1.25. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.25: https://github.com/intel/perfmon/commit/78d6273c546329052429e3a005491b58fbe= 1167b https://github.com/intel/perfmon/commit/f069ed9d0b69b02d76d4b4c59dfc75b62bf= b2254 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sapphirerapids/cache.json | 30 +- .../arch/x86/sapphirerapids/frontend.json | 19 - .../arch/x86/sapphirerapids/memory.json | 15 +- .../arch/x86/sapphirerapids/metricgroups.json | 10 +- .../arch/x86/sapphirerapids/pipeline.json | 23 - .../arch/x86/sapphirerapids/spr-metrics.json | 968 ++++++++++-------- .../arch/x86/sapphirerapids/uncore-io.json | 138 ++- 8 files changed, 631 insertions(+), 574 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 9cdd8818c087..2b0414f4b7aa 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -27,7 +27,7 @@ GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core -GenuineIntel-6-8F,v1.23,sapphirerapids,core +GenuineIntel-6-8F,v1.25,sapphirerapids,core GenuineIntel-6-AF,v1.04,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json b/too= ls/perf/pmu-events/arch/x86/sapphirerapids/cache.json index eec7bf6ebd53..e35dbb7c2ccd 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json @@ -92,11 +92,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -311,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -322,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -333,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -344,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -355,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -366,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -377,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -388,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -430,7 +420,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -441,7 +430,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -452,7 +440,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -463,7 +450,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -473,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -484,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -493,7 +477,6 @@ "Counter": "0,1,2,3", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3.", "SampleAfterValue": "100007", "UMask": "0x10" @@ -504,7 +487,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -515,7 +497,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -526,7 +507,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -537,7 +517,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -548,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -559,7 +537,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -570,7 +547,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -581,7 +557,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -592,7 +567,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3.", "SampleAfterValue": "1000003", "UMask": "0x80" diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json index f6e3e40a3b20..bf68493d4509 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, @@ -244,7 +227,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -256,7 +238,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json b/to= ols/perf/pmu-events/arch/x86/sapphirerapids/memory.json index 2ea19539291b..41d4120d4dae 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json @@ -63,7 +63,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -76,7 +75,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -89,7 +87,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -102,7 +99,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -115,7 +111,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +123,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -141,7 +135,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -154,7 +147,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -167,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -178,7 +169,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -305,17 +295,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.jso= n b/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json index e1de6c2675c4..9129fb7b7ce4 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json @@ -40,6 +40,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -86,7 +87,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -96,6 +99,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", @@ -116,10 +120,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", @@ -133,5 +140,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json index 5d5811f26151..50cacfbbc7cf 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json @@ -62,7 +62,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -71,7 +70,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -81,7 +79,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -91,7 +88,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -101,7 +97,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -111,7 +106,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -121,7 +115,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -131,7 +124,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -141,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -151,7 +142,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -160,7 +150,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -170,7 +159,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -180,7 +168,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -190,7 +177,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -200,7 +186,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -210,7 +195,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -220,7 +204,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -469,7 +452,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -479,7 +461,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -488,7 +469,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -497,7 +477,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -506,7 +485,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -516,7 +494,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json= b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json index 2b3b013ccb06..37584a8a75aa 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,85 +55,85 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", "MetricName": "iio_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", "MetricName": "iio_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM / UNC_CHA_TOR_INSE= RTS.IO_ITOM", "MetricName": "io_percent_of_inbound_full_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_MISS_RFO) / (UNC_CHA_TOR_INSERTS.IO_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_RFO)", "MetricName": "io_percent_of_inbound_partial_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_PCIRDCUR / UNC_CHA_TOR_= INSERTS.IO_PCIRDCUR", "MetricName": "io_percent_of_inbound_reads_that_miss_l3", "ScaleUnit": "100%" @@ -142,14 +142,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -237,25 +237,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -285,19 +285,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -360,7 +360,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -372,7 +372,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -380,12 +380,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -394,35 +394,125 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", - "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", - "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", - "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -430,24 +520,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -455,7 +545,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%" }, @@ -464,45 +554,92 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_gro= up;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -511,17 +648,17 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_c= lks - tma_pmm_bound if #has_pmem > 0 else MEMORY_ACTIVITY.STALLS_L3_MISS / = tma_info_thread_clks)", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -530,7 +667,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -538,75 +675,73 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "81 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_gr= oup;tma_frontend_bound_group;tma_issueFB", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", - "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -615,7 +750,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -624,7 +759,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -632,8 +775,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -641,8 +784,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -650,8 +793,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -659,8 +802,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -668,39 +811,37 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", - "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%" }, { @@ -708,40 +849,40 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -755,7 +896,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -769,15 +910,15 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -785,104 +926,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_laten= cy + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1= _hit_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_big_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + cpu@RS.EMPTY\\,um= ask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_0) / (tma_amx_busy += tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split= _loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_d= ram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tm= a_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + t= ma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = + tma_streaming_stores - tma_store_latency)) + tma_machine_clears * (1 - tm= a_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -943,11 +990,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -960,20 +1007,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1002,15 +1049,21 @@ "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1028,7 +1081,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1036,7 +1089,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1044,7 +1097,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1052,7 +1105,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1060,7 +1113,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1068,7 +1121,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1076,7 +1129,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1121,7 +1174,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1131,7 +1184,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1178,7 +1231,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1196,7 +1249,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1238,13 +1291,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1263,12 +1316,12 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1321,26 +1374,33 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", - "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / duration_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / duration_tim= e", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / duratio= n_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1369,7 +1429,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1390,18 +1450,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1415,7 +1475,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1433,28 +1493,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / d= uration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / t= ma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1464,13 +1524,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1481,44 +1542,45 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@ if #has_pme= m > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1528,10 +1590,17 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "cha_0@event\\=3D0x0@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1540,7 +1609,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -1551,7 +1620,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1560,14 +1629,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1577,13 +1647,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1599,7 +1669,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9" + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", @@ -1607,7 +1685,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%" }, { @@ -1615,8 +1693,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1624,8 +1702,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1633,26 +1711,26 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1660,8 +1738,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1669,17 +1756,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1687,19 +1774,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "DefaultMetricgroupName": "TopdownL2", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1716,7 +1802,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1724,53 +1810,76 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", - "MetricGroup": "BadSpec;BvMS;Default;MachineClears;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1778,32 +1887,31 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1816,7 +1924,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%" }, { @@ -1824,8 +1932,8 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1838,21 +1946,29 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%" }, { @@ -1861,7 +1977,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1869,7 +1985,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1883,19 +1999,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1904,16 +2020,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (MEMO= RY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks) if 1e6 * (MEM_LOAD_L3_MI= SS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_M= ISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%" }, { @@ -1922,7 +2029,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -1931,7 +2038,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1940,25 +2047,25 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,um= ask\\=3D0xc@)) / tma_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.= STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL = + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info= _thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(cpu@RS.EMPTY\= \,umask\\=3D1@ - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (= CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_threa= d_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1966,7 +2073,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1976,8 +2083,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%" }, { @@ -1986,36 +2093,35 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", - "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" }, @@ -2024,7 +2130,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2033,8 +2139,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%" }, { @@ -2043,7 +2149,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2052,8 +2158,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -2061,17 +2167,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -2079,8 +2185,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -2088,17 +2194,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2115,7 +2221,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -2123,7 +2229,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -2131,7 +2261,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2140,7 +2270,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2149,8 +2279,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json b= /tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json index 91013ced74aa..aab082ff9402 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json @@ -201,7 +201,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -214,7 +214,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -227,7 +227,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 2", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -240,7 +240,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -253,7 +253,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 4", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -266,7 +266,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 5", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -279,7 +279,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 6", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +292,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 7", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +316,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -342,7 +342,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -355,7 +355,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -368,7 +368,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -381,7 +381,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -394,7 +394,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -431,7 +431,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 c= ard is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -443,7 +443,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -455,7 +455,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -467,7 +467,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -479,7 +479,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 ca= rd is plugged in to slot 0", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -491,7 +491,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 1", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -503,7 +503,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +515,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 3", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -662,7 +662,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -675,7 +675,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -688,7 +688,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -701,7 +701,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -714,7 +714,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -727,7 +727,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -740,7 +740,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -753,7 +753,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -766,7 +766,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -779,7 +779,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -792,7 +792,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -818,7 +818,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -831,7 +831,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -844,7 +844,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -857,7 +857,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -974,7 +974,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1082,7 +1081,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1299,7 +1297,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Passing data= to be written : How often different queues (e.g. channel / fc) ask to send= request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1351,7 +1349,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Request Owne= rship : How often different queues (e.g. channel / fc) ask to send request = into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1364,7 +1362,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Writing line= : How often different queues (e.g. channel / fc) ask to send request into = pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1377,7 +1375,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Pass= ing data to be written : How often different queues (e.g. channel / fc) are= allowed to send request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1390,7 +1388,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing final read or write of line : How often different queues (e.g. channel = / fc) are allowed to send request into pipeline", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1403,7 +1401,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Proc= essing response from IOMMU : How often different queues (e.g. channel / fc)= are allowed to send request into pipeline", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1416,7 +1414,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing to IOMMU : How often different queues (e.g. channel / fc) are allowed t= o send request into pipeline", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1429,7 +1427,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Requ= est Ownership : How often different queues (e.g. channel / fc) are allowed = to send request into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1442,7 +1440,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Writ= ing line : How often different queues (e.g. channel / fc) are allowed to se= nd request into pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1962,7 +1960,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : PCIe Request complete : = Only for posted requests : Each PCIe request is broken down into a series o= f cacheline granular requests and each cacheline size request may need to m= ake multiple passes through the pipeline (e.g. for posted interrupts or mul= ti-cast). Each time a single PCIe request completes all its cacheline gra= nular requests, it advances pointer.", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1975,7 +1973,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Writing line : Only for = posted requests : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1988,7 +1986,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Issuing final read or wr= ite of line : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2001,7 +1999,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Passing data to be writt= en : Only for posted requests : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2014,7 +2012,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Passing dat= a to be written : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -2026,7 +2024,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2039,7 +2037,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Request Own= ership : Only for posted requests", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -2052,7 +2050,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Writing lin= e : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2065,7 +2063,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Passing data = to be written : Each PCIe request is broken down into a series of cacheline= granular requests and each cacheline size request may need to make multipl= e passes through the pipeline (e.g. for posted interrupts or multi-cast). = Each time a cacheline completes a single pass (e.g. posts a write to singl= e multi-cast target) it advances state : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -2091,7 +2089,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Request Owner= ship : Each PCIe request is broken down into a series of cacheline granular= requests and each cacheline size request may need to make multiple passes = through the pipeline (e.g. for posted interrupts or multi-cast). Each tim= e a cacheline completes a single pass (e.g. posts a write to single multi-c= ast target) it advances state : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2104,7 +2102,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Writing line = : Each PCIe request is broken down into a series of cacheline granular requ= ests and each cacheline size request may need to make multiple passes throu= gh the pipeline (e.g. for posted interrupts or multi-cast). Each time a c= acheline completes a single pass (e.g. posts a write to single multi-cast t= arget) it advances state : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2309,7 +2307,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 0/1/2/3, Or x8 card plugged in to Lane= 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2322,7 +2320,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2335,7 +2333,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 2= ", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2348,7 +2346,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2361,7 +2359,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 4/5/6/7, Or x8 card plugged in to Lane= 4/5, Or x4 card is plugged in to slot 4", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2374,7 +2372,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 5", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2387,7 +2385,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 6/7, Or x4 card is plugged in to slot 6= ", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2400,7 +2398,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 7", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD64919F41A for ; Mon, 9 Dec 2024 22:28:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783350; cv=none; b=a4I5mCzog3uyfHyyF/+i2A4AHyPCeSumusDVx5ZpEYOthKpMAMxDzlND7YTE3JFJl1vMGYdigAoAREu3/pD7ezdTvVTWJviOb9YD1+obGvkoKCGbR3uV2astMXIEHr4avtmxLDCH2vIWTh/v61xRN0ifw8G3tvsLA7BcWq9HyB0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783350; c=relaxed/simple; bh=8WIuanVlqzZOJzIR7UvB84gyYI26gJ+0VZ4PMKtaREA=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=LoXMZw4zTZlobAz/07/xrR+TMWh+qbj3R8G9l/lnGsOmFq60gay9U3gesbOx6cgfjou0McJ8POc/hFszJU1mPKDwrqjSYluIrax3f5u6m3BadHjW0wnmJhF3J5BMOUfByhfCPBx/NNe6EObofdz7uQwlJe0FUoIIBJ+Oi/wtHKQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=sj59qVTt; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sj59qVTt" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6eff377fdc1so43219897b3.2 for ; Mon, 09 Dec 2024 14:28:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783334; x=1734388134; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=EPHtsyhzspSxjeT7DwnjX3IDFtDYgUQJABCDVWuBhoU=; b=sj59qVTtTzxN278zGfWTMhgJbYfDsgO14tWZ06E26GQ/0/y6XBYrQv66+KCW0YC4XJ TImejRI5PTtMo/+wcTkIobImYec4l9a81puJr0MEVhaqt+v2HkBt/QFLhg0O5pzyh9aj QO0vqOqd6hwdWgn+igIVEJc5VKsgBQuDvwpMvCKtI3Q7bRLL385vaF06Pk5SvVUgjp/e NzRsDLYzHv6ejNeO7KBYN8XOuZhE84Z8EBPhwhtEGMwShlS0pb8AtHxOngLID86rGAJv SmgE3gRS+nzeem+0xJlY5aUkZpyr7SvjmoMvMhAODzWVkL+I75Ss0RJKcJuv36SeAY4e WlSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783334; x=1734388134; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=EPHtsyhzspSxjeT7DwnjX3IDFtDYgUQJABCDVWuBhoU=; b=dLjk0e+dno3w7YQP2nQvfffr4xEg/AlSvg4Gx7I53TYqn7iRxH/Cxnk0Y6+I04vTX+ sjx5/OwcyzcT8dR8k6zQKXDT+H2XaaKz7kZXGG4dvGoMjscOn3+PHweuEWGo4Od6qJRu 1bl6ALmXtV39TE0li0ZZIC8iftUQEOlMIBRiW8lGZ2fk5hTEuHdzhGE7hD5UvWSwIZRG JG6DoWUha/0jjN0wKNfZ6w29Rkmg+Zlf87myFB6Pc2AiXT97Gz5mrMtSm6803fYm6Sas PvIUOvUqjeZR5gv22m9jed6Hl/KYIt1sMEk7HxVlUKbd9GX8IPmMD0ZLCwOvlrNuhQlS 5ekQ== X-Forwarded-Encrypted: i=1; AJvYcCVIYAKlDnOADD2a8jhDL8hLWBakEL/vZ7U1hVkHHl/ioZMN+eDVvhTe0ywzAxpd+42OpDC4+ft52zQk57s=@vger.kernel.org X-Gm-Message-State: AOJu0YzR6RRC5d4wFwNOvVFDx/uJi+/eBewmRliJvk36quE4pBt66Ckr svVW6DDPVeQUux9KUCIWYJJg9g4+rG4hbkLZqBsyHvRE0xjWMxtXN89bn2ijgsES69bTcKh/Dco +HrHc4Q== X-Google-Smtp-Source: AGHT+IHRmWGn6BX1A9QZUpOE79PFNzzSl0pHYAPe/m1YhmJp5DcgYupAn7FAtvVeAiJE3nwinaglwfjJEoRL X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:318c:b0:6ef:6561:ed74 with SMTP id 00721157ae682-6efe3c7c6b0mr93917b3.7.1733783333882; Mon, 09 Dec 2024 14:28:53 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:56 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-20-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 19/22] perf vendor events: Update Sierraforest events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.04 to v1.07. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.07: https://github.com/intel/perfmon/commit/903b3d0a0a61bb6064013db9eb4c26457da= cfea6 https://github.com/intel/perfmon/commit/825c4361473e676119b51f04c7896a8cfa8= a5ea5 https://github.com/intel/perfmon/commit/bafe6a7b5cbee92c31ec19dfcefd6dcc243= e4e8a The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sierraforest/counter.json | 2 +- .../arch/x86/sierraforest/other.json | 20 ++ .../arch/x86/sierraforest/pipeline.json | 3 +- .../arch/x86/sierraforest/srf-metrics.json | 308 ++++++------------ .../arch/x86/sierraforest/uncore-cache.json | 28 +- .../x86/sierraforest/uncore-interconnect.json | 87 +++++ .../arch/x86/sierraforest/uncore-io.json | 280 ++++++++-------- .../arch/x86/sierraforest/uncore-memory.json | 122 ++++++- .../arch/x86/sierraforest/uncore-power.json | 98 ++++++ 10 files changed, 581 insertions(+), 369 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 2b0414f4b7aa..70497b8c65a2 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -28,7 +28,7 @@ GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core GenuineIntel-6-8F,v1.25,sapphirerapids,core -GenuineIntel-6-AF,v1.04,sierraforest,core +GenuineIntel-6-AF,v1.07,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core GenuineIntel-6-55-[01234],v1.35,skylakex,core diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/counter.json b/too= ls/perf/pmu-events/arch/x86/sierraforest/counter.json index e57e3bf98b2a..c62a147e360b 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/counter.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/counter.json @@ -52,7 +52,7 @@ { "Unit": "PCU", "CountersNumFixed": "0", - "CountersNumGeneric": 4 + "CountersNumGeneric": "4" }, { "Unit": "CHACMS", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/other.json b/tools= /perf/pmu-events/arch/x86/sierraforest/other.json index 28f9a4c3ea84..4c77dac8ec78 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/other.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/other.json @@ -18,6 +18,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json b/to= ols/perf/pmu-events/arch/x86/sierraforest/pipeline.json index b67c0c89054d..40fa4f5ae261 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -185,7 +185,6 @@ "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json b= /tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json index b881b1958f11..83c86afd2960 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json @@ -1,4 +1,11 @@ [ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, { "BriefDescription": "C1 residency percent per core", "MetricExpr": "cstate_core@c1\\-residency@ / TSC", @@ -7,17 +14,24 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per module", - "MetricExpr": "cstate_module@c6\\-residency@ / TSC", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Module_Residency", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { @@ -28,7 +42,21 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -49,67 +77,67 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic contoller (IIO) of IO reads that are initiated by end device controller= s that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic contoller (IIO) of IO reads that are initiated by end device controller= s that are requesting memory from the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", "MetricName": "iio_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", "MetricName": "iio_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" @@ -118,14 +146,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -177,25 +205,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -225,13 +253,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_LOCAL + UNC_CH= A_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DR= D_OPT_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL + UNC_CHA_TOR_= INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_R= EMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_C= HA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_OPT_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF= _REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -260,17 +288,15 @@ { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", "MetricExpr": "tma_core_bound", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", "MetricExpr": "TOPDOWN_BE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -278,104 +304,92 @@ { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", "MetricExpr": "TOPDOWN_BAD_SPECULATION.ALL_P / (6 * CPU_CLK_UNHALT= ED.CORE)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", - "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS)", "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (6 * CPU_CLK_UNHALTED.CORE)= ", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls", "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (6 * CPU_CLK_UNH= ALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls", "MetricExpr": "TOPDOWN_FE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses", "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (6 * CPU_CLK_UN= HALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -403,32 +417,6 @@ "MetricGroup": "Flops", "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp" }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "tma_info_bottleneck_dtlb_miss_bound_cycles", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "tma_info_bottleneck_ifetch_miss_bound_cycles", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "tma_info_bottleneck_load_miss_bound_cycles", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "tma_info_bottleneck_mem_exec_bound_cycles", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound" - }, { "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", "MetricExpr": "100 * (LD_HEAD.DTLB_MISS_AT_RET + LD_HEAD.PGWALK_AT= _RET) / CPU_CLK_UNHALTED.CORE", @@ -510,21 +498,6 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio" }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "tma_info_buffer_stalls_load_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "tma_info_buffer_stalls_mem_rsv_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "tma_info_buffer_stalls_store_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles" - }, { "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.LD_BUF / CPU_CLK_UNHALTED= .CORE", @@ -546,7 +519,8 @@ { "BriefDescription": "Cycles Per Instruction", "MetricExpr": "CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY", - "MetricName": "tma_info_core_cpi" + "MetricName": "tma_info_core_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Floating Point Operations Per Cycle", @@ -564,21 +538,6 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL_P / INST_RETIRED.ANY", "MetricName": "tma_info_core_upi" }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l2h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.LLC_MISS / MEM_BOUND_= STALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss" - }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.L2_HIT / MEM_BOUND_ST= ALLS_IFETCH.ALL", @@ -591,24 +550,6 @@ "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l2hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l3hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.LLC_MISS / MEM_BOUND_ST= ALLS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s" - }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.L2_HIT / MEM_BOUND_STAL= LS_LOAD.ALL", @@ -656,16 +597,6 @@ "MetricExpr": "1e3 * MACHINE_CLEARS.SMC / INST_RETIRED.ANY", "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki" }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_adressaliasing", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_storefwdblk", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk" - }, { "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", "MetricExpr": "100 * LD_BLOCKS.ADDRESS_ALIAS / MEM_UOPS_RETIRED.AL= L_LOADS", @@ -678,31 +609,6 @@ "MetricName": "tma_info_mem_exec_blocks_loads_with_storefwdblk", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_l1miss", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_otherpipeline= blks", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_pagewalk", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_stlbhit", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_storefwding", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding" - }, { "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", "MetricExpr": "100 * LD_HEAD.L1_MISS_AT_RET / LD_HEAD.ANY_AT_RET", @@ -758,11 +664,6 @@ "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / TOPDOWN_RETIRING= .ALL_P", "MetricName": "tma_info_mem_mix_memload_ratio" }, - { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "tma_info_serialization_tpause_cycles", - "MetricName": "tma_info_serialization _%_tpause_cycles" - }, { "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", "MetricExpr": "100 * SERIALIZATION.C01_MS_SCB / (6 * CPU_CLK_UNHAL= TED.CORE)", @@ -783,14 +684,17 @@ }, { "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE_P@k / CPU_CLK_UNHALTED.CO= RE", - "MetricGroup": "Summary", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P:k / CPU_CLK_UNHALTED.CORE", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization" }, { @@ -814,102 +718,90 @@ "MetricName": "tma_info_uop_mix_x87_uop_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses", "MetricExpr": "TOPDOWN_FE_BOUND.ITLB_MISS / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (6 * CPU_C= LK_UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (6 * CPU_CLK_UNHALTE= D.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized", "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes", "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (6 * CPU_CLK_UNHALTED.C= ORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that result = in retirement slots", "MetricExpr": "TOPDOWN_RETIRING.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json = b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json index f37107373e3b..4406b7f28c67 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json @@ -1066,7 +1066,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -1086,7 +1086,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt from local IA that miss the cac= he", + "BriefDescription": "Data read opt from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT", @@ -1096,7 +1096,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt prefetch from local IA that mis= s the cache", + "BriefDescription": "Data read opt prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT_PREF", @@ -1400,7 +1400,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt_Prefs issued by iA Cor= es that missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt prefetch from lo= cal iA that missed the LLC targeting local memory", "UMask": "0xc8a6fe01", "Unit": "CHA" }, @@ -1410,7 +1410,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_REMOTE", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt_Prefs issued by iA Cor= es that missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt prefetch from lo= cal iA that missed the LLC targeting remote memory", "UMask": "0xc8a77e01", "Unit": "CHA" }, @@ -1420,7 +1420,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_REMOTE", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt issued by iA Cores tha= t missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt from local iA th= at missed the LLC targeting remote memory", "UMask": "0xc8277e01", "Unit": "CHA" }, @@ -1595,7 +1595,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_LOCAL", @@ -1624,7 +1624,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_LOCAL", @@ -1634,7 +1634,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_REMOTE", @@ -1644,7 +1644,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_REMOTE", @@ -1736,7 +1736,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO", @@ -1746,7 +1746,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO_PREF", @@ -2442,7 +2442,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD_OPT", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opts issued by iA Cores = that hit the LLC", "UMask": "0xc827fd01", @@ -2453,7 +2452,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD_OPT_PREF", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt_Prefs issued by iA C= ores that hit the LLC", "UMask": "0xc8a7fd01", @@ -2663,7 +2661,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_OPT", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt issued by iA Cores t= hat missed the LLC", "UMask": "0xc827fe01", @@ -2674,7 +2671,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_OPT_PREF", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt_Prefs issued by iA C= ores that missed the LLC", "UMask": "0xc8a7fe01", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnec= t.json b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.js= on index 80440edac431..2ccbc8bca24e 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json @@ -814,6 +814,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -824,6 +844,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", @@ -1196,6 +1256,33 @@ "UMask": "0x4", "Unit": "UPI" }, + { + "BriefDescription": "Cycles in L0p", + "Counter": "0,1,2,3", + "EventCode": "0x27", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Counter": "0,1,2,3", + "EventCode": "0x28", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Counter": "0,1,2,3", + "EventCode": "0x29", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, { "BriefDescription": "Matches on Transmit path of a UPI Port : Non-= Coherent Bypass", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json b/t= ools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json index cffb9d94b53d..886b99a971be 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -233,7 +233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -245,7 +245,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -257,7 +257,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -269,7 +269,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -281,7 +281,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -293,7 +293,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -305,7 +305,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -317,7 +317,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -341,7 +341,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -352,7 +352,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -363,7 +363,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -374,7 +374,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -385,7 +385,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -396,7 +396,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -418,7 +418,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -429,7 +429,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -440,7 +440,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -451,7 +451,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -462,7 +462,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -473,7 +473,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -484,7 +484,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -495,7 +495,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -506,7 +506,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -517,7 +517,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -528,7 +528,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -539,7 +539,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -550,7 +550,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -561,7 +561,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -572,7 +572,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -583,7 +583,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -594,7 +594,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -605,7 +605,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -616,7 +616,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -638,7 +638,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -649,7 +649,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -661,7 +661,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -673,7 +673,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -685,7 +685,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -697,7 +697,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -709,7 +709,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -721,7 +721,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -733,7 +733,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -745,7 +745,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -757,7 +757,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -769,7 +769,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -781,7 +781,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -793,7 +793,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -817,7 +817,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -829,7 +829,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -841,7 +841,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -853,7 +853,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1129,7 +1129,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1141,7 +1141,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1153,7 +1153,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1165,7 +1165,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1177,7 +1177,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1201,7 +1201,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1213,7 +1213,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -1225,7 +1225,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -1237,7 +1237,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1249,7 +1249,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1261,7 +1261,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1273,7 +1273,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1285,7 +1285,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1297,7 +1297,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1318,7 +1318,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1329,7 +1329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1340,7 +1340,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1351,7 +1351,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1373,7 +1373,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1384,7 +1384,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1395,7 +1395,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1406,7 +1406,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1417,7 +1417,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1428,7 +1428,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1439,7 +1439,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1450,7 +1450,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1461,7 +1461,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1472,7 +1472,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1483,7 +1483,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1494,7 +1494,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1505,7 +1505,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1516,7 +1516,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1527,7 +1527,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1538,7 +1538,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1549,7 +1549,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1560,7 +1560,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1571,7 +1571,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1582,7 +1582,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1593,7 +1593,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1604,7 +1604,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1615,7 +1615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1626,7 +1626,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1637,7 +1637,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1648,7 +1648,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1659,7 +1659,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1670,7 +1670,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1681,7 +1681,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1692,7 +1692,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1703,7 +1703,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1715,7 +1715,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1727,7 +1727,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1739,7 +1739,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1751,7 +1751,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1763,7 +1763,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1775,7 +1775,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1787,7 +1787,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1799,7 +1799,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1811,7 +1811,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1823,7 +1823,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1835,7 +1835,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1847,7 +1847,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1859,7 +1859,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1871,7 +1871,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1883,7 +1883,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1895,7 +1895,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json= b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json index 7e6e6764f181..ae9c62b32e92 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", @@ -188,6 +188,104 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x10", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x20", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x40", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x80", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e and all pages are closed", + "Counter": "0,1,2,3", + "EventCode": "0x88", + "EventName": "UNC_M_POWER_CHANNEL_PPD_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -360,6 +458,28 @@ "PerPkg": "1", "Unit": "IMC" }, + { + "BriefDescription": "subevent0 - # of cycles all ranks were in SR = subevent1 - # of times all ranks went into SR subevent2 -# of times ps_sr_= active asserted (SRE) subevent3 - # of times ps_sr_active deasserted (SRX) = subevent4 - # of times PS-&>Refresh ps_sr_req asserted (SRE) subevent5 - # = of times PS-&>Refresh ps_sr_req deasserted (SRX) subevent6 - # of cycles PS= Ctrlr FSM was in FATAL", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles all ranks were in SR", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json = b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json index 02e59f64a544..9ea852ef190e 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json @@ -7,5 +7,103 @@ "PerPkg": "1", "PublicDescription": "PCU Clockticks: The PCU runs off a fixed 1 = GHz clock. This event counts the number of pclk cycles measured while the = counter was enabled. The pclk, like the Memory Controller's dclk, counts a= t a constant rate making it a good measure of actual wall time.", "Unit": "PCU" + }, + { + "BriefDescription": "Thermal Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x04", + "EventName": "UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Thermal Strongest Upper Limit Cycles : Numbe= r of cycles any frequency is reduced due to a thermal limit. Count only if= throttling is occurring.", + "Unit": "PCU" + }, + { + "BriefDescription": "Power Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x05", + "EventName": "UNC_P_FREQ_MAX_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Power Strongest Upper Limit Cycles : Counts = the number of cycles when power is the upper limit on frequency.", + "Unit": "PCU" + }, + { + "BriefDescription": "Cycles spent changing Frequency", + "Counter": "0,1,2,3", + "EventCode": "0x74", + "EventName": "UNC_P_FREQ_TRANS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Cycles spent changing Frequency : Counts the= number of cycles when the system is changing frequency. This can not be f= iltered by thread ID. One can also use it with the occupancy counter that = monitors number of threads in C0 to estimate the performance impact that fr= equency transitions had on the system.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C2E", + "Counter": "0,1,2,3", + "EventCode": "0x2b", + "EventName": "UNC_P_PKG_RESIDENCY_C2E_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C2E : Counts the= number of cycles when the package was in C2E. This event can be used in c= onjunction with edge detect to count C2E entrances (or exits using invert).= Residency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C6", + "Counter": "0,1,2,3", + "EventCode": "0x2d", + "EventName": "UNC_P_PKG_RESIDENCY_C6_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C6 : Counts the = number of cycles when the package was in C6. This event can be used in con= junction with edge detect to count C6 entrances (or exits using invert). R= esidency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C0", + "Counter": "0,1,2,3", + "EventCode": "0x35", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C0", + "PerPkg": "1", + "PublicDescription": "Number of cores in C0 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C3", + "Counter": "0,1,2,3", + "EventCode": "0x36", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Number of cores in C3 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C6", + "Counter": "0,1,2,3", + "EventCode": "0x37", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C6", + "PerPkg": "1", + "PublicDescription": "Number of cores in C6 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "External Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x0a", + "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "External Prochot : Counts the number of cycl= es that we are in external PROCHOT mode. This mode is triggered when a sen= sor off the die determines that something off-die (like DRAM) is too hot an= d must throttle to avoid damaging the chip.", + "Unit": "PCU" + }, + { + "BriefDescription": "Internal Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x09", + "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Internal Prochot : Counts the number of cycl= es that we are in Internal PROCHOT mode. This mode is triggered when a sen= sor on the die determines that we are too hot and must throttle to avoid da= maging the chip.", + "Unit": "PCU" } ] --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03F0D19EEB4 for ; Mon, 9 Dec 2024 22:28:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783367; cv=none; b=mMao09wA/Ab3/Rvn7CE6lKzf/jOMka8bSNdAd1WMAqXGABr9FewUK8MamJHvpU0ClDoRRopBeH4k3yoEVl6Mm2QZRbxlKkHgoNTSBJQ1jbjKEA2KYbqpPS9gsF8g2QSk7FnRJwcgNYclJNU//OB7NPE5fSV46Y+HNN37vcJp/vo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783367; c=relaxed/simple; bh=wLaxzs4Q0GG8Z5ZCzxTF0gNYb0KXpJrEmQAMFHMwUZ8=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=YFtTlbXqLefPEoTieHlRIYHD8h0g/DaTh+ndqY5grniLzuo0wLmfr9heOzyzUCydUu+NIREMAOc2emmrqSyJbmmLkxn7zN1JyUUMbdtW7Q2qPmKRJ8BabdezeQcpfZXkBqsRHmWjLjHekAM70CACwigXQag6g4Gb14G/ljJD8OY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SoilQHeY; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SoilQHeY" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e35e0e88973so6842158276.0 for ; Mon, 09 Dec 2024 14:28:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783336; x=1734388136; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=jRbWfEwEe7lIPNmn6UVBO+5De1l/s5FJ4YimGsfYbtg=; b=SoilQHeYapeDlLqTkc7kYqlRvuwm9L2U7WJfm5gLJMPd2oyscjLTQifkFDF3aHdxIP YVHyDyTtJvY6r7SBZ7tPE4m8xTF03Dwgq2r80QEPa9i52IYiUG5WvvfU6Yebi3tzZrGS yWr6/xamxZTeU32EO5+VQtPYwQ12XbLWAclMSue+yzTnA1PJOYZ7a0yq6FrCMFuFgNBD JU2pnm5BqDgaqmhZHdDN7fYmmjVf8xO2emG+/Z8FrxMT4dL9KHd4AG0pPpQo9l+k/ZO0 E4c6DkmYB+TEU/h0bRs7kwwURXu6KTQ88fCfGAZ3AGIUcoFhJ1YRzFiVEBoSfrWKUVj4 lDbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783336; x=1734388136; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=jRbWfEwEe7lIPNmn6UVBO+5De1l/s5FJ4YimGsfYbtg=; b=telgePtNTgObQ+egh4iX7Wk2TazXEQosV3BPVldyZFwv+D7hdXWR8zCkTByypkehOW ibQh2so3TDC1/JA5BlKAnBwrF2xrMHlHN05NdyTZerfEjSDvs5WZCv5pcAu+FE69cwvx yFPeatbjtTgs+0kEoAakMuUmB+XvKZu63N9F4BzVWgpM5/zWR/ywLeQhJrb60SmDBFmZ 9imWgu1L2k/ru6dIyoVs0ezfxQ6ojRqnYL6mh4WtvLfxDBGD3G+5ULh73t2UwJhMxZ3J qp0iKwssbZPLPM2mm0TAi4UG5LHBLhY2L0I5dAzIDehTKvf97YQ9s5eRdFrAIXv5WcZy Y5zQ== X-Forwarded-Encrypted: i=1; AJvYcCVOZuTnwWVH2h1cxouGb/cnqfYKH71jCop4GLrCM+73n70STvHRJ3XOCw2abiFqfrv/G+beeXQeoBcAQ2I=@vger.kernel.org X-Gm-Message-State: AOJu0Yyi7Ri2TI4pIMDCbKV6GxMpikXFsA9oPrEXUmh09lJFGlAzuYsi apWv/SsSx47IDhuYFDXufBdzDfw9Nd+b0SNEpznrIIFnHDnS+LR5SshnAQlxH92Q81QP5wiT1qE U2Qr48A== X-Google-Smtp-Source: AGHT+IHVx4bXsi7REW2lUfu9OReewl1F1WsEY+wN+UXkwNIWjPTyEzppSuDnBKJbo16RPQh1kzxFwIbQRZNs X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:b94e:0:b0:e2e:317a:d599 with SMTP id 3f1490d57ef6-e3a5dcbf0cbmr9406276.2.1733783336204; Mon, 09 Dec 2024 14:28:56 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:57 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-21-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 20/22] perf vendor events: Update Skylake metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.01. The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/skylake/metricgroups.json | 9 +- .../arch/x86/skylake/skl-metrics.json | 684 ++++++++++-------- 2 files changed, 406 insertions(+), 287 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/skylake/metricgroups.json index 3a88260194d1..00172bff5219 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -112,10 +115,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +134,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json b/tool= s/perf/pmu-events/arch/x86/skylake/skl-metrics.json index 4e954fe8547c..2a76dd01fb52 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -91,7 +91,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -112,9 +112,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -123,7 +216,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -131,8 +224,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -140,8 +233,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -149,18 +242,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(18.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HITM + 16.5 * tma_info_system_core_frequency * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED= .L1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((22 * tma_info_system_core_frequency - 3.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (20 * tma_= info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LO= AD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -171,25 +296,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "16.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(20 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -198,7 +323,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -208,8 +333,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -218,7 +343,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -226,47 +351,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "22 * tma_info_system_core_frequency * OFFCORE_RESPO= NSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related m= etrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_= data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -276,7 +401,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -286,17 +411,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -306,7 +431,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -315,7 +440,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -323,8 +448,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -333,8 +458,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -342,8 +467,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -351,8 +476,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -362,50 +487,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -415,7 +540,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -430,8 +555,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -439,7 +564,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -447,108 +572,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_hit_= latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_laten= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -577,7 +604,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -605,11 +632,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -622,20 +649,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -665,7 +692,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -684,7 +717,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -692,7 +725,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -700,7 +733,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -708,7 +741,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -716,7 +749,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -756,7 +789,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -766,7 +799,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -801,7 +834,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -819,7 +852,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -861,13 +894,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -886,7 +919,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -942,7 +975,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -963,18 +996,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -992,15 +1025,15 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1010,13 +1043,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1027,18 +1061,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D1@", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TR= K_REQUESTS.DATA_READ) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TR= K_REQUESTS.DATA_READ) / (tma_info_system_socket_clks / tma_info_system_time= )", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -1051,6 +1098,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1058,7 +1112,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1067,14 +1121,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1100,43 +1155,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1144,17 +1208,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "6.5 * tma_info_system_core_frequency * (MEM_LOAD_RE= TIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= ) / tma_info_thread_clks", + "MetricExpr": "(10 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1162,18 +1226,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1191,7 +1255,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1199,15 +1263,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (9 = * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAND= ING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1219,16 +1307,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1236,8 +1324,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1248,11 +1336,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1266,7 +1354,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1274,8 +1362,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1288,12 +1376,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1301,8 +1389,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1311,7 +1399,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1319,7 +1407,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1333,19 +1421,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1354,7 +1442,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1363,7 +1451,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -1399,7 +1487,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1408,7 +1496,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1425,8 +1513,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1434,8 +1522,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1443,7 +1531,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1452,16 +1540,16 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1479,7 +1567,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1489,8 +1577,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1498,17 +1586,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1516,8 +1604,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1525,18 +1613,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / M= EM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1552,7 +1640,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1560,7 +1648,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1568,7 +1680,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1577,8 +1689,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A60D1A08AB for ; Mon, 9 Dec 2024 22:29:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783372; cv=none; b=huYah8TFMzoRWjUV0lluafTbNoZZ4s544Yv/1MzCIZZuBpXL6HQfjCggoRrVMdsxJAqInpaTetzyYkOrl6lBWw9N6RO5X4lII5vBmPsWsOasNfPhshBMEgZhqP5UJET0bjtAhU75U0CWhnt6tFMawzqe+uJqP2HdhOnU1XWhcgA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783372; c=relaxed/simple; bh=8AU30XDpIPBmpM9jT4rv635xj7eNMgdVeu40Y4ihlP4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=OmCFjXhGIn1ke4iYVGUbq9kRVsKHeHr4Vxu7Ic8e05xttKgfjpw7Aj2iYXZEYX2gOY/EAxgfAicLaQM8KPBUqP+E46bI5LdW7tNL1umIkypWvp49o3Tr6uAEswoZPGnhB/ip3rpofib7ZaWgxLnqDyZ6rey7OSA9VogE0skcVhY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0dNXWvQb; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0dNXWvQb" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ef6e33c182so60187377b3.2 for ; Mon, 09 Dec 2024 14:29:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783339; x=1734388139; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=1IZljeR3Cy32LO/tqCIrjKnHFzNqb6qfVRx8CYVPbUQ=; b=0dNXWvQb2pmu43i34YxLES8admofX8AIeVJaTr5zZVRN2DMsR8Pzr0Wv738RvUZptX 5RokysthlL1Dz7MiP0Fw6N1jPChpkCqmeP3RTy8q6UkY1A6I0fqZlP4pKkyeXNyQM3IS T7psD9N8NHSMcmwDOvZhWObg3yZKFolEK2Lzp1mVpn9Z4DqTmE6TdaED6jpW4xNICkD3 inFN9CgcZTU2zBadJFYtmtZ5MudmGx5l9W6V8S95t1E/PQUuCtDtH1qoyAh9IhphvwbJ oxOHFZsc7l+DQRvjzZkZIuwE1/pojMI4zgFzW99TegKwIrEbIfTytv5/r99ABl9Cd4jM fWoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783339; x=1734388139; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=1IZljeR3Cy32LO/tqCIrjKnHFzNqb6qfVRx8CYVPbUQ=; b=aHIAToVyUsNFG/eRTBAc8turKelBlHUwFL+3Z+cR+Sudj2h6qKKG8duAviNg65MAOS SM8VZpnUKyBca5eqMQizC182pTHzRGMo02Vy5CWlMFbtQY4oeUIZB79lP2+CuhSg/vvY UGufYvdn/+VxckQo0MzLTUcEchI4kh/DjdxvGXErBUykGub/1e0+ac2tcfzJQxqJsSZO LigjRmyS/HXk/rDgww9bTv32ajP99i6m5cmA8k9gvavHmI+VWl8rgkG/qaTutg+iIYxS 8kX+bWSkWdr5PsUM0b40OObJ6yaMNlh6ZjfHGKMZx55p7/+7Jmt5STPJuPBPh+X5kln2 4iog== X-Forwarded-Encrypted: i=1; AJvYcCUhUlm5AUJHBNVKYUypUBNzA1boEauJp7qxBBw/glTuwAPx0lw3tzcKmmBvs7YfdBr8VwfaELdZsLAk2rs=@vger.kernel.org X-Gm-Message-State: AOJu0Yytj+UmADhBNL42cnwej2gjxB/I9N9wJn6PW/aJR0sjcPcOqBwv /P+5FBClmSYqpD71QJ1YvvJQx5k1o/EXMVXxdnfEW6gR2+nM0JsOBBOrjvAVtvCH1IPIdZr7dF7 2zxEpRA== X-Google-Smtp-Source: AGHT+IF4PStk3dn+mSWkvgMZyn+iS9xNPE9YHpAZHdVvXSd5lQyBVoqM4LJC2mr/RUYPXTLRpPstenvn0Zbg X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a05:690c:4782:b0:6ef:5e7a:aa72 with SMTP id 00721157ae682-6efe3cb751bmr173707b3.8.1733783338770; Mon, 09 Dec 2024 14:28:58 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:58 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-22-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 21/22] perf vendor events: Update SkylakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.35 to v1.36. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.36: https://github.com/intel/perfmon/commit/f6801e5c145406f355f40e1746f836eaa14= 26cf9 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/skylakex/metricgroups.json | 9 +- .../arch/x86/skylakex/skx-metrics.json | 740 ++++++++++-------- .../arch/x86/skylakex/uncore-cache.json | 60 +- .../x86/skylakex/uncore-interconnect.json | 11 - 5 files changed, 465 insertions(+), 357 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 70497b8c65a2..e41bccc625af 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -31,7 +31,7 @@ GenuineIntel-6-8F,v1.25,sapphirerapids,core GenuineIntel-6-AF,v1.07,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core -GenuineIntel-6-55-[01234],v1.35,skylakex,core +GenuineIntel-6-55-[01234],v1.36,skylakex,core GenuineIntel-6-86,v1.23,snowridgex,core GenuineIntel-6-8[CD],v1.16,tigerlake,core GenuineIntel-6-2C,v5,westmereep-dp,core diff --git a/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/skylakex/metricgroups.json index cccfcab3425e..a579603f720b 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -113,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json b/too= ls/perf/pmu-events/arch/x86/skylakex/skx-metrics.json index e5e86892d7bb..2fe630cd4927 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,31 +76,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO= _PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART= 2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -109,14 +109,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -192,25 +192,25 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -240,13 +240,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 32@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 31@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -295,12 +295,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -312,7 +312,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -323,7 +323,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -333,9 +333,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -344,7 +437,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -352,8 +445,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -361,8 +454,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -370,18 +463,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_R= ESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 44 * tma_info_system_= core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED= .FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OFFCOR= E_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND= _DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SN= OOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -392,25 +517,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RES= PONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA= _RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_H= IT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / = 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM= _OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFF= CORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_R= ETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -419,7 +544,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -429,8 +554,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -439,7 +564,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -447,47 +572,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(110 * tma_info_system_core_frequency * (OFFCORE_RE= SPONSE.DEMAND_RFO.L3_MISS.REMOTE_HITM + OFFCORE_RESPONSE.PF_L2_RFO.L3_MISS.= REMOTE_HITM) + 47.5 * tma_info_system_core_frequency * (OFFCORE_RESPONSE.DE= MAND_RFO.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.PF_L2_RFO.L3_HIT.HITM_OT= HER_CORE)) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Rela= ted metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses,= tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -497,7 +622,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -507,17 +632,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -527,7 +652,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -536,7 +661,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -544,17 +669,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -563,8 +688,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -572,8 +697,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -581,7 +706,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -592,50 +717,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -645,7 +770,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -660,8 +785,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -669,7 +794,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -677,108 +802,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_hit_= latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_laten= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -807,7 +834,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -832,14 +859,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -852,20 +879,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -895,7 +922,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -910,11 +943,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -922,7 +955,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -930,7 +963,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -938,7 +971,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -946,7 +979,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -954,7 +987,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -994,7 +1027,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1004,7 +1037,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1051,7 +1084,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1069,7 +1102,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1111,13 +1144,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1136,7 +1169,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -1192,7 +1225,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1213,18 +1246,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1242,29 +1275,29 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1274,13 +1307,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1298,24 +1332,37 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1323,7 +1370,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1331,7 +1378,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1345,6 +1392,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1353,12 +1407,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1367,14 +1421,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1400,43 +1455,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1444,17 +1508,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1462,18 +1526,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1491,7 +1555,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1499,24 +1563,48 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1528,16 +1616,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1545,8 +1633,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1557,11 +1645,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1575,7 +1663,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1583,8 +1671,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1597,12 +1685,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1610,8 +1698,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1620,7 +1708,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1628,7 +1716,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1642,19 +1730,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1708,7 +1796,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1717,7 +1805,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1734,8 +1822,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1743,8 +1831,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1752,7 +1840,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1761,35 +1849,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1807,7 +1895,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1817,8 +1905,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1826,17 +1914,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1844,8 +1932,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1853,18 +1941,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1880,7 +1968,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1888,7 +1976,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1896,7 +2008,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1905,8 +2017,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json b/to= ols/perf/pmu-events/arch/x86/skylakex/uncore-cache.json index 4fc818626491..da46a3aeb58c 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json @@ -4454,7 +4454,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4465,7 +4465,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4476,7 +4476,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4486,7 +4486,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4496,7 +4496,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that hit the LLC : Counts the number of entries successfully inserted into = the TOR that match qualifications specified by the subevent. Does not inc= lude addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4507,7 +4507,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4528,7 +4528,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4539,7 +4539,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4550,7 +4550,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4560,7 +4560,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4570,7 +4570,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that missed the LLC : Counts the number of entries successfully inserted in= to the TOR that match qualifications specified by the subevent. Does not = include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4581,7 +4581,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4624,7 +4624,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO ItoM requests that mis= s the LLC. An ItoM request is used by IIO to request a data write without f= irst reading the data for ownership.", "UMask": "0x24", @@ -4636,7 +4636,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RdCur requests and mis= s the LLC. A RdCur request is used by IIO to read data without changing sta= te.", "UMask": "0x24", @@ -4648,7 +4648,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RFO requests that miss= the LLC. A read for ownership (RFO) requests a cache line to be cached in = E state with the intent to modify.", "UMask": "0x24", @@ -4865,7 +4865,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4876,7 +4876,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4887,7 +4887,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4897,7 +4897,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4907,7 +4907,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that hit the LLC : For each cycle, this event accumulates the number of v= alid entries in the TOR that match qualifications specified by the subevent= . Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4918,7 +4918,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4939,7 +4939,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4950,7 +4950,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4961,7 +4961,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4971,7 +4971,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4981,7 +4981,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that missed the LLC : For each cycle, this event accumulates the number o= f valid entries in the TOR that match qualifications specified by the subev= ent. Does not include addressless requests such as locks and interrupts= .", "UMask": "0x21", @@ -4992,7 +4992,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5037,7 +5037,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO ItoM req= uests that miss the LLC. An ItoM is used by IIO to request a data write wit= hout first reading the data for ownership.", "UMask": "0x24", @@ -5049,7 +5049,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RdCur re= quests that miss the LLC. A RdCur request is used by IIO to read data witho= ut changing state.", "UMask": "0x24", @@ -5061,7 +5061,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RFO requ= ests that miss the LLC. A read for ownership (RFO) requests data to be cach= ed in E state with the intent to modify.", "UMask": "0x24", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.js= on b/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json index 216a00237cd1..eef801c68bd2 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json @@ -13754,16 +13754,5 @@ "PerPkg": "1", "PublicDescription": "Number outstanding register requests within = message channel tracker", "Unit": "UBOX" - }, - { - "BriefDescription": "UPI interconnect send bandwidth for payload. = Derived from unc_upi_txl_flits.all_data", - "Counter": "0,1,2,3", - "EventCode": "0x2", - "EventName": "UPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", - "ScaleUnit": "7.11E-06Bytes", - "UMask": "0xf", - "Unit": "UPI" } ] --=20 2.47.1.545.g3c1d2e2a6a-goog From nobody Fri Dec 27 09:48:59 2024 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18DCA1A0B04 for ; Mon, 9 Dec 2024 22:29:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783376; cv=none; b=nglzQ/MkGqRPOD2yLfeIwlACPSM1vOfOLdR84vsS6V7vkQNx7LhTRaTKQ0n5A8dtXZWynfRT4UZNW8ADWbVCCi+6q5Vgtgvc6bxoGvyCQ8wVawJKXxJ71SrjoLYkm07m2PAcwdrZ5NeVTPGuRy2xSPaPBSP/n3D7BPdI3NLQWrY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733783376; c=relaxed/simple; bh=yrJ0KWjqE1gloJf6sy9L7EAZ3I0TOUeukIJDJqIiNK0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=eUBYKkhAGqn5ICDop6aCnNvFu2yn4P1O4ljjhNbJm+7cSuZY1RbmiJRFU7GAbk4gKFl/f/zibbcRiS5MF0f/kWXV2C1sznPRnZFPJ48yJA/Ic9CHIvnBCiXs0Y3y3OtAoPrwPlXPj9QatcI7pvAKkT+K1Kn33jJHPYC7Dy73FFc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=y2JHxqeQ; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="y2JHxqeQ" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e399d4ef55cso7773308276.2 for ; Mon, 09 Dec 2024 14:29:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733783342; x=1734388142; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=R8ynERD96+X8/Btm53ZA6KnnAggeByIYQC88tpBvl3Q=; b=y2JHxqeQ0gXSGO/MIThzPqpkfjvRx6OJ68F6mLVm/bLnSoJGn4PQ+2gmigVS3rgLrq rrvOBQbP4okFDDovypKQ0OktpeaomiYUzXNBDZ1GOSTfsu8pL+yJ53VQNY9DK+vxEp2k NRoLYcwbOndxAfk9R9YkhJauyAoSBjfq440KbDe2dgBkus2hUPTBYLfMSEdGLPk6W26V NbkOw7opjFgl5ER1ZRaN24rzaWXb8vVlXkVGhzHKib0+9fs+kSLSWWgJvqm/drWpvKe1 iPvQVirjG5WClf1NzQU0Vrtnx2IfTzsJ9NjKqvK2eXWhOkh6fQt8SvV48pfVtNfWCbr1 FX2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733783342; x=1734388142; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=R8ynERD96+X8/Btm53ZA6KnnAggeByIYQC88tpBvl3Q=; b=H4aJLX3fHREKfIx7V520wRlTCGfwNvXWVi/MiCnYDauORe0Dz1r0ZgzaxqydYkwqPQ i9QnzOu8UBbD6f3SehVF8x4aP9z+zThhHcZDffJS6h2jNWvRKu101joKXaCWvyTvgzRb 5vcUC2eauWMVwoJqnfxpxqfAPjd/Pdoe9pJHeHc+RWs+Fj5iyVd6+0y/Jhd9aQPyTZMK EEZC0+vy1/EOBZCyyapsl676mhUbarRRuyr0T8GYx8BhGysaOlPaE+9Sg7kyfcQCUYNb 7zSrK1gNTK8hPV5q5q3EqJ35/hjpQ01ZyTF70QFkF2deFeFqLH6BDVa8fmQBmvHj6+Bq zpRQ== X-Forwarded-Encrypted: i=1; AJvYcCWlIcyzAH1focJEUZ4HZH6N4BICQaAFfybO5Tz3ZWi4HNR0of/YBgOh5DmhefD+NQxYtvMT7IPliyyL5J0=@vger.kernel.org X-Gm-Message-State: AOJu0YyDUDEdtH0kSo/L0ohyHjwH722whuPDOxZ//ZxPcpK0Czl/s2gz io5KOAs15iF9jTAket1jcUB3Xs17DpponchiN4/LWmDEw42BLrgsayNJTB0Dw0xEh9QpXfxwuQU jvcTW4A== X-Google-Smtp-Source: AGHT+IFGTwcouNFvnuDny2+L2rDIjt6ByBiMvN/MWOszQ7lQk7jK+WF479hAVA3xhYNseSABPYrnmZu/ZjeB X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c8f1:280e:e8ad:32]) (user=irogers job=sendgmr) by 2002:a25:804a:0:b0:e39:6fdc:556d with SMTP id 3f1490d57ef6-e3a0aeff4c1mr36748276.0.1733783341831; Mon, 09 Dec 2024 14:29:01 -0800 (PST) Date: Mon, 9 Dec 2024 14:27:59 -0800 In-Reply-To: <20241209222800.296000-1-irogers@google.com> Message-Id: <20241209222800.296000-23-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209222800.296000-1-irogers@google.com> X-Mailer: git-send-email 2.47.1.545.g3c1d2e2a6a-goog Subject: [PATCH v1 22/22] perf vendor events: Update Tigerlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.16 to v1.17. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.17: https://github.com/intel/perfmon/commit/e1d5ac3412450bf049301cb26206d03c410= 66b83 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/tigerlake/cache.json | 45 +- .../arch/x86/tigerlake/frontend.json | 17 - .../pmu-events/arch/x86/tigerlake/memory.json | 13 +- .../arch/x86/tigerlake/metricgroups.json | 10 +- .../arch/x86/tigerlake/pipeline.json | 30 +- .../arch/x86/tigerlake/tgl-metrics.json | 745 +++++++++++------- .../x86/tigerlake/uncore-interconnect.json | 4 +- .../arch/x86/tigerlake/uncore-other.json | 2 +- .../arch/x86/tigerlake/virtual-memory.json | 18 + 10 files changed, 513 insertions(+), 373 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index e41bccc625af..3b15bc6b1654 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -33,7 +33,7 @@ GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core GenuineIntel-6-55-[01234],v1.36,skylakex,core GenuineIntel-6-86,v1.23,snowridgex,core -GenuineIntel-6-8[CD],v1.16,tigerlake,core +GenuineIntel-6-8[CD],v1.17,tigerlake,core GenuineIntel-6-2C,v5,westmereep-dp,core GenuineIntel-6-25,v4,westmereep-sp,core GenuineIntel-6-2F,v4,westmereex,core diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/tigerlake/cache.json index f4144a1110be..d316eefee857 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/cache.json @@ -75,14 +75,23 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xf2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3", + "EventCode": "0xf2", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "L2 code requests", "Counter": "0,1,2,3", @@ -233,7 +242,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -244,7 +252,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -255,7 +262,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -266,7 +272,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -277,7 +282,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -288,7 +292,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -299,7 +302,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -310,7 +312,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -321,7 +322,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions where a cro= ss-core snoop hit in another cores caches on this socket, the data was forw= arded back to the requesting core as the data was modified (SNOOP_HITM) or = the L3 did not have the data(SNOOP_HIT_WITH_FWD).", "SampleAfterValue": "20011", "UMask": "0x4" @@ -332,7 +332,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -343,7 +342,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -354,7 +352,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions in which th= e L3 supplied the data and a cross-core snoop hit in another cores caches o= n this socket but that other core did not forward the data back (SNOOP_HIT_= NO_FWD).", "SampleAfterValue": "20011", "UMask": "0x2" @@ -365,7 +362,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access", "SampleAfterValue": "100007", "UMask": "0x4" @@ -376,7 +372,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -387,7 +382,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -398,7 +392,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -409,7 +402,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -420,7 +412,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -431,7 +422,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -442,7 +432,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -458,7 +447,7 @@ "UMask": "0x1" }, { - "BriefDescription": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "BriefDescription": "Counts demand data reads that hit a cacheline= in the L3 where a snoop hit in another cores caches which forwarded the da= ta to the requesting core.", "Counter": "0,1,2,3", "EventCode": "0xB7, 0xBB", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", @@ -532,6 +521,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles when offcore outstanding Demand Data R= ead transactions are present in SuperQueue (SQ), queue to uncore", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json b/tools= /perf/pmu-events/arch/x86/tigerlake/frontend.json index 13c052d0f470..6ebdf6c5ce11 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/memory.json b/tools/p= erf/pmu-events/arch/x86/tigerlake/memory.json index a125cefa100f..f99018cab7e7 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/memory.json @@ -25,7 +25,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -38,7 +37,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -51,7 +49,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -64,7 +61,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +73,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -90,7 +85,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -103,7 +97,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -116,7 +109,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -135,17 +127,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/tigerlake/pipeline.json index 09b53b0722a9..7ef1bac08463 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -396,7 +388,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -406,7 +397,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -415,7 +405,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -424,7 +413,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/to= ols/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json index c45c6b4a380d..8c0cd6e63a2a 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(49 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) + 48 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((54 * tma_info_system_core_frequency - 5 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (53 * tma_info_system_core_frequ= ency - 5 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "48 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(53 * tma_info_system_core_frequency - 5 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "54 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,7 +604,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -470,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -479,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -488,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -650,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -667,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -716,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -734,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -742,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -750,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -758,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -766,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -774,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -819,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -829,7 +886,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -864,7 +921,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -882,7 +939,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -924,13 +981,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -949,21 +1006,15 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", @@ -989,6 +1040,13 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", @@ -1015,8 +1073,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1043,18 +1101,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1072,14 +1130,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (arb@event\\=3D0x81\\,umask\\=3D0x1@ + arb@eve= nt\\=3D0x84\\,umask\\=3D0x1@) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1089,13 +1147,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1118,12 +1177,25 @@ "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1131,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1139,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1153,6 +1225,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1160,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1169,14 +1248,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1186,13 +1266,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1208,33 +1288,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1243,8 +1331,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "5 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1253,17 +1350,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17.5 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(22.5 * tma_info_system_core_frequency - 5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1271,18 +1368,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1299,7 +1396,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1307,16 +1404,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1326,7 +1447,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1336,16 +1457,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1353,8 +1474,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1364,11 +1485,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1382,7 +1503,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1390,8 +1511,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1405,19 +1526,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1425,8 +1554,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1434,7 +1563,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1449,19 +1578,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1497,7 +1626,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1505,8 +1634,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1514,8 +1643,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1523,7 +1652,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1532,7 +1661,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1541,14 +1670,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1561,7 +1690,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1570,7 +1699,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1579,8 +1708,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1589,17 +1718,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1607,8 +1736,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1617,17 +1746,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1644,7 +1773,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1652,7 +1781,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1660,7 +1813,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1669,7 +1822,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1678,8 +1831,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.j= son b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json index 1500bf109c99..3732b6a68fe8 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_ARB_COH_TRK_REQUESTS.ALL", + "BriefDescription": "Number of entries allocated. Account for Any = type: e.g. Snoop, etc.", "Counter": "0,1", "EventCode": "0x84", "EventName": "UNC_ARB_COH_TRK_REQUESTS.ALL", @@ -90,7 +90,7 @@ "Unit": "ARB" }, { - "BriefDescription": "UNC_ARB_TRK_REQUESTS.ALL", + "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", "Counter": "0,1", "EventCode": "0x81", "EventName": "UNC_ARB_TRK_REQUESTS.ALL", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json b/t= ools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json index 62dc0fc76748..76eeca979e0b 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts page walks completed due to demand da= ta stores whose address translations missed in the TLB and were mapped to 1= G pages. The page walks can end with or without a page fault.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", --=20 2.47.1.545.g3c1d2e2a6a-goog