From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C9A418A6D4 for ; Sat, 22 Mar 2025 06:34:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625272; cv=none; b=Dn46uG13smmZo/DoHh1dmSrAU8oOqg3RJk3fiOL1Y7qljqylk4XJ9oupoy7kxxMug1BJ77Xmbm+xQgob/pXflmAZoOkDPH/0zejspkR0LNf8hCIz4pIxCduccD2Jr3ih8wTLeTrRPpp/BV7ErzmXKvRlt7Y4xZKl7hYMjwIZdMs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625272; c=relaxed/simple; bh=4FPUsIkI2Jj+5+9O1ngomErPVk5nYs3hF77c/cFAskI=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=iumCISJ6wBb7dKk1EVD+bjqccntrRKHk0FbDNG9KWn8HMvTApQIo9ymnELtm6t5lhuXHw6FOu101yxaVyALVjH/aiqDBZeMF8Jvn2y8/IYEaHmeijGPXf/6hZGLmOq5JOKgUXouDBA42nvEIJPUshNfwJIO3kr10Tz3J6VlEXE4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=o+M0BeE2; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="o+M0BeE2" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e575f472438so3930971276.1 for ; Fri, 21 Mar 2025 23:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625264; x=1743230064; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=eF9J9QMtnwfKnqx+D/oCj2ZUCf9Pw341GFP+/jL0cJs=; b=o+M0BeE2yBJp7pcPjoSmMxXVoK4Pa06sM4qGwHEuIE17AxTqciyUM4ctQdqwWskK+4 IuR+D/hEAqNiXUgld2/UUmBDgaaXaWVTWekhnVrNfBh+2RSDLaOfaVtJaGVKU3QzeLjT 352wAje3GG5OHBZ4YPL2dDw1LUFopzXC1BhyvtnRnV5qEOS2DefATMbQyce6zK+LxSNr t9Au2DTF3NMm15EpAaTDhKOd4Q2ydXhrX6HSUz/LtqnBXt8gjo+EJhGbLhYfOBtYmXL6 1eQnduXW56mL3A4SqIkSqixKaEbZ+n65kjf/SQAv5YRGZve3fxVD+UJFvCsRoBZW0/EZ FyTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625264; x=1743230064; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=eF9J9QMtnwfKnqx+D/oCj2ZUCf9Pw341GFP+/jL0cJs=; b=X4Rk8uzFBKnaItyDJbvScqvMi8VPH/f8mz0jWOqd1Nq8Yf+IMGaWnl/x/Kr+bXP49w mi49dt91sPtz+DNM1ktW5s9osm9YPL7bBqaE5+0xmGtYyu1ICDBdkXboSX+mXqjKcEGb 82fW+FOinWO7Xz9UAEG++9ZiRvyfUGhzwKSEFDlmx8+ZV5a4l0GFjM9kxHWQ5eN22N0R h0yI8X2gOT9rrsCGPqzJdlV+GfDy4UpWCkCXSjQtS1E/V0YKlRhU4391r4XiCVVYLBSf dzz7uVz9gRNWldTmly9dCMgLYv86hpvkdoV5USPK79zbu5IYSWWBJEJhYF18f5FboOgi J4Dw== X-Forwarded-Encrypted: i=1; AJvYcCUNRQKa0wmujJ3JKo/Pnh/41leajw1REjzhLcSuHnCE0Y6NK8rPyStjYIpz0f7Xz+qJmooV9PjdxkTitaU=@vger.kernel.org X-Gm-Message-State: AOJu0YwtcUflx4sF53Bs8Qymy3sdEqLNzqaamaYPAhBUUsZQUqv3PD1O 9Alt+6ZbKxIPnRIbYB95YhwIUbEFrO3baudTf/fuYIVR50CbVy87yFKuERBfM2cmZhtRSZ8MOmh 7Iyu2eQ== X-Google-Smtp-Source: AGHT+IGsXJgcHKHboSG3AlEyhovzfxJ5+YeBrSCsIsRqX7fqHy1z5DTKNxGyW7ROWIaj+6ddtL2357pNXmzp X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:e053:0:b0:e66:721b:138a with SMTP id 3f1490d57ef6-e66a4bfe52emr2377276.1.1742625263924; Fri, 21 Mar 2025 23:34:23 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:29 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-2-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 01/35] perf vendor events: Update alderlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.28 to v1.29. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/alderlake/adl-metrics.json | 485 +++++++++--------- .../pmu-events/arch/x86/alderlake/cache.json | 77 +++ .../pmu-events/arch/x86/alderlake/memory.json | 55 ++ .../pmu-events/arch/x86/alderlake/other.json | 196 ------- .../arch/x86/alderlake/pipeline.json | 67 ++- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 6 files changed, 441 insertions(+), 441 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/to= ols/perf/pmu-events/arch/x86/alderlake/adl-metrics.json index 147379cae37b..2b88590e3756 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json @@ -103,7 +103,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -113,7 +113,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound >0.10)", + "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%", @@ -125,7 +125,7 @@ "MetricExpr": "(5 * cpu_atom@CPU_CLK_UNHALTED.CORE@ - (cpu_atom@TO= PDOWN_FE_BOUND.ALL@ + cpu_atom@TOPDOWN_BE_BOUND.ALL@ + cpu_atom@TOPDOWN_RET= IRING.ALL@)) / (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation >0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%", @@ -136,7 +136,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -146,7 +146,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -156,7 +156,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -165,7 +165,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -174,7 +174,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -184,7 +184,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -193,7 +193,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (5 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -203,7 +203,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound >0.20)", + "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -213,7 +213,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -222,7 +222,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -232,7 +232,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (5 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -567,7 +567,7 @@ "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@ / cpu_atom@CPU_CL= K_UNHALTED.CORE@", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", "Unit": "cpu_atom" }, { @@ -606,7 +606,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -615,7 +615,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", + "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -625,7 +625,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -634,7 +634,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -643,7 +643,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (5 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -652,7 +652,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (5 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -661,7 +661,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (5 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -670,7 +670,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (5 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -679,7 +679,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -688,7 +688,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -699,7 +699,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "(tma_retiring >0.75)", + "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -709,7 +709,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -721,7 +721,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -734,13 +734,13 @@ "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -751,7 +751,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -768,13 +768,13 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20", @@ -791,7 +791,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -800,7 +800,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependen= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_= bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma= _l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_fb= _full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tm= a_store_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_= latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_s= tore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)) + tma_mem= ory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store= + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming= _stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -809,16 +809,16 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_= clears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_br= anch_mispredicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_u= nknown_branches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_reste= ers + tma_ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth *= tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_co= de", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * t= ma_ms / (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code= ", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", @@ -826,7 +826,7 @@ }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_s= witches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_rest= eers * tma_other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_r= esteers + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses= + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_= dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ls= d + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / = (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite = + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_bra= nch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_n= ukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cpu_= core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / (t= ma_divider + tma_ports_utilization + tma_serializing_operation) + tma_micro= code_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (t= ma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -835,7 +835,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound= + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_sto= re_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -844,16 +844,16 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -866,11 +866,11 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", "Unit": "cpu_core" }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -879,7 +879,7 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -893,26 +893,26 @@ "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -921,7 +921,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -931,26 +931,26 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_D= EMAND_CODE_RD@ / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -959,7 +959,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -968,35 +968,35 @@ "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * cpu_core@ITLB_MISSES.WALK_COMP= LETED_2M_4M@ / (cpu_core@ITLB_MISSES.WALK_COMPLETED_4K@ + cpu_core@ITLB_MIS= SES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * cpu_core@ITLB_MISSES.WALK_COMP= LETED_4K@ / (cpu_core@ITLB_MISSES.WALK_COMPLETED_4K@ + cpu_core@ITLB_MISSES= .WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((28 * tma_info_system_core_frequency - 3 * tma_inf= o_system_core_frequency) * (cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (c= pu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_R= D.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FW= D@))) + (27 * tma_info_system_core_frequency - 3 * tma_info_system_core_fre= quency) * cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@) * (1 + cpu_core@MEM_= LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info_t= hread_clks", + "MetricExpr": "(25 * tma_info_system_core_frequency * (cpu_core@ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@))) + 24 * tma_info_system_core_frequ= ency * cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@) * (1 + cpu_core@MEM_LOA= D_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info_thre= ad_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1007,26 +1007,26 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "(27 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ + = cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (1 - cpu_core@OCR.DEMAND_DATA_= RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + c= pu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@))) * (1 + cpu_core@ME= M_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info= _thread_clks", + "MetricExpr": "24 * tma_info_system_core_frequency * (cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_F= WD@ * (1 - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HIT_WITH_FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_L= OAD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1036,7 +1036,7 @@ "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1046,7 +1046,7 @@ "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1057,7 +1057,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1066,28 +1066,28 @@ "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D0x1@ + cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@, max(cpu_core@CYCLE_ACTIV= ITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVITY.CYCLES_L1D_MISS@, 0)) / tma_= info_thread_clks", + "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D1@ + cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@, max(cpu_core@CYCLE_ACTIVIT= Y.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVITY.CYCLES_L1D_MISS@, 0)) / tma_in= fo_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D0x1@ + cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@) / tma_info_core_core_clk= s", + "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D1@ + cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1096,8 +1096,8 @@ "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1118,18 +1118,18 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - cpu_core@I= NT_MISC.UOP_DROPPING@ / tma_info_thread_slots", + "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1149,7 +1149,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1159,16 +1159,16 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "cpu_core@ARITH.FPDIV_ACTIVE@ / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1177,8 +1177,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1187,8 +1187,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1197,8 +1197,8 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1207,41 +1207,41 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - cpu_core@IN= T_MISC.UOP_DROPPING@ / tma_info_thread_slots", + "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1250,8 +1250,8 @@ "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1264,7 +1264,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", @@ -1272,7 +1272,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", @@ -1280,15 +1280,15 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -1320,7 +1320,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -1329,7 +1329,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1338,10 +1338,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", "Unit": "cpu_core" }, { @@ -1412,12 +1413,12 @@ "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" @@ -1432,22 +1433,22 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency", "Unit": "cpu_core" @@ -1497,14 +1498,14 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", @@ -1524,7 +1525,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", "Unit": "cpu_core" }, { @@ -1533,7 +1534,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1542,7 +1543,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1551,7 +1552,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1560,7 +1561,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1623,7 +1624,7 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1769,7 +1770,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" @@ -1849,7 +1850,7 @@ }, { "BriefDescription": "", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D0x1@)", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" @@ -1880,20 +1881,20 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1946,23 +1947,22 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.FAR_BR= ANCH:u", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / cpu_core@CPU_CLK_UNHA= LTED.THREAD@", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -2030,7 +2030,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", @@ -2041,7 +2041,6 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { @@ -2049,7 +2048,7 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", "Unit": "cpu_core" }, { @@ -2061,14 +2060,14 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" @@ -2086,15 +2085,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "MetricThreshold": "tma_info_thread_uptb < 9", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2104,7 +2103,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2113,8 +2112,8 @@ "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2123,8 +2122,8 @@ "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2133,8 +2132,8 @@ "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2143,7 +2142,7 @@ "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2153,7 +2152,7 @@ "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2163,7 +2162,7 @@ "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2173,7 +2172,7 @@ "MetricExpr": "3 * tma_info_system_core_frequency * cpu_core@MEM_L= OAD_RETIRED.L2_HIT@ * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM= _LOAD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2183,18 +2182,18 @@ "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(12 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2)) / tma= _info_thread_clks", + "MetricExpr": "9 * tma_info_system_core_frequency * (cpu_core@MEM_= LOAD_RETIRED.L3_HIT@ * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@ME= M_LOAD_RETIRED.L1_MISS@ / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2203,19 +2202,19 @@ "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2234,7 +2233,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2243,34 +2242,34 @@ "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_1G@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_2M_4M@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_C= OMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_4K@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2279,7 +2278,7 @@ "MetricExpr": "(16 * max(0, cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ = - cpu_core@L2_RQSTS.ALL_RFO@) + cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu= _core@MEM_INST_RETIRED.ALL_STORES@ * (10 * cpu_core@L2_RQSTS.RFO_HIT@ + min= (cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.C= YCLES_WITH_DEMAND_RFO@))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2290,7 +2289,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2301,16 +2300,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_c= lks", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clk= s", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2320,34 +2319,34 @@ "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -2370,7 +2369,7 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2386,18 +2385,18 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "max(cpu_core@IDQ.MS_CYCLES_ANY@, cpu_core@UOPS_RETI= RED.MS\\,cmask\\=3D0x1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISS= UED.ANY@)) / tma_info_core_core_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "max(cpu_core@IDQ.MS_CYCLES_ANY@, cpu_core@UOPS_RETI= RED.MS\\,cmask\\=3D1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUE= D.ANY@)) / tma_info_core_core_clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -2406,10 +2405,10 @@ }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tm= a_info_thread_clks", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2420,7 +2419,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2429,7 +2428,7 @@ "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2445,20 +2444,20 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2468,7 +2467,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2478,7 +2477,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2488,7 +2487,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2498,7 +2497,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2507,8 +2506,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= 3_PORTS_UTIL@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_ACTIVE@ < cpu_= core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ e= lse (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTI= VITY.2_3_PORTS_UTIL@) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2517,8 +2516,8 @@ "MetricExpr": "(cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + max(cpu= _core@RS.EMPTY_RESOURCE@ - cpu_core@RESOURCE_STALLS.SCOREBOARD@, 0)) / tma_= info_thread_clks * (cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_AC= TIVITY.BOUND_ON_LOADS@) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2527,7 +2526,7 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2538,8 +2537,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2549,7 +2548,7 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2557,7 +2556,7 @@ { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2571,7 +2570,7 @@ "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2581,8 +2580,8 @@ "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2592,7 +2591,7 @@ "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2603,7 +2602,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2612,8 +2611,8 @@ "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ / tma_info_= core_core_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2622,7 +2621,7 @@ "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2632,8 +2631,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2642,8 +2641,8 @@ "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2652,8 +2651,8 @@ "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2672,7 +2671,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2681,34 +2680,34 @@ "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_1G@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_2M_4M@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_= core@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.W= ALK_COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_4K@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2717,7 +2716,7 @@ "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2727,7 +2726,7 @@ "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2737,8 +2736,8 @@ "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/alderlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/cache.json index a20e19738046..04c53035e967 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/cache.json @@ -1050,6 +1050,28 @@ "UMask": "0x3", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10008", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -1094,6 +1116,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -1160,6 +1204,28 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -1215,6 +1281,17 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/memory.json b/tools/p= erf/pmu-events/arch/x86/alderlake/memory.json index fa15f5797bed..c06507a40bde 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/memory.json @@ -253,6 +253,17 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -264,6 +275,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -297,6 +330,17 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -330,6 +374,17 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/other.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/other.json index a8b23e92408c..ae3a6630bd72 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/other.json @@ -55,116 +55,6 @@ "UMask": "0x1", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.COREWB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10008", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3,4,5", @@ -209,92 +99,6 @@ "UMask": "0x1", "Unit": "cpu_core" }, - { - "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.SWPF_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x14000", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.SWPF_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784004000", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "Deprecated": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.COUNT", - "Invert": "1", - "SampleAfterValue": "100003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", - "Counter": "0,1,2,3,4,5,6,7", - "Deprecated": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.CYCLES", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", - "Counter": "0,1,2,3,4,5", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/alderlake/pipeline.json index f5bf0816f190..f08c1d6a99ba 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json @@ -1213,8 +1213,9 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "BriefDescription": "This event is deprecated.", "Counter": "0,1,2,3,4,5", + "Deprecated": "1", "EventCode": "0xc3", "EventName": "MACHINE_CLEARS.SLOW", "SampleAfterValue": "20003", @@ -1289,6 +1290,70 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "Deprecated": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.COUNT", + "Invert": "1", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.CYCLES", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 56d5fc419acf..881f418137fd 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,5 +1,5 @@ Family-model,Version,Filename,EventType -GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core +GenuineIntel-6-(97|9A|B7|BA|BF),v1.29,alderlake,core GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-C[56],v1.07,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BA6F193079 for ; Sat, 22 Mar 2025 06:34:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625271; cv=none; b=uM9yRmA8K8Xa6Y6SCk3QlFm8TtSrus4yMMWBVK2fOiS2om6k2/DvToD3EfG+UmDBR8JD6nko3N3ZjaWSHIf8RFVxJxZMrXDy2zZ0VGeCY0TjjEo0k2yIBatDhdxePA3WbP/kaoHlI4Wuki905Tmtioqxcp0Jz9ydVUVyaSooaSY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625271; c=relaxed/simple; bh=s0JRgLoxxuvunaHY7JyhOI/QMjmxBJL1LP9dIRrQ/aA=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=qyRgu4fQTXQ1sYkJF6bpnK1vqEWmP/htjR2foUteWT06RVMwkgVNm2TSwyoGsNwiuCWqJJRhNGczfl/bEEv8vpnzx2IXGFxapalQsxfvp1eJXbk6jbt/ddFjuBtBhm6FxJh3LYzcxCgCHAgjzqbj8KQbj9xB7TRQSPRSZxEJ0dg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Gut6fQlC; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Gut6fQlC" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6fef68ecc8eso38020117b3.1 for ; Fri, 21 Mar 2025 23:34:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625267; x=1743230067; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=aGWGBvU/x5QSdgsAS6s6D7Oyp5KpZml9VkSt87fTBIw=; b=Gut6fQlCfFlX/w43x2oT8R+xgoyQ3aHK2RS0pbsvgPLf1vjn/3Bmiex+cXJo2yJBY8 3M0EPRw2E/ZcgK7+tN0K+mR0MkI2aYAw1oXBU/ezMu+vBIiBVkbTbTaywM65PwRlRI0g v+6wniw3yG5x05DFUp9hYYfPPCD+TvAsSkaia8OzAPKocY/+qfjFjaTDyUYyY4Ew9EcU XA8tHCq9OZYDgTTsFqEKeC/Huz9P2hCGMw/ECBHmm1CU1AASPMJwKHofyZ4QTHZR9uMT JFVby9eid8DtH1m7yOq8b/K2b+BAkFP1/ktjOx6+GWZtxu2DtfUE/lIt9C43fWKKUF4r WSOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625267; x=1743230067; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=aGWGBvU/x5QSdgsAS6s6D7Oyp5KpZml9VkSt87fTBIw=; b=sZDPnvykYhJUrGmDI0Jj0VKUhIPLsAzslcM9G7mTIyXKFk97lHA45j3Bs0z+BBjW8Y GAqT9QuSBy+5aDDkh3lRnkRvmnvSJis1sLhByAPh62RrsCvkd66UM54qprp2yy0DRluk 9PCofkC47ZtNOg9HR+Jw3lq6HTNmqcrDU4UxE0eFx0Kp1UL/NoyF79y1ONMot53es7Ep iKl/bMY6hbJuQ4V1cXGLoakAtNwPFwny3JHGYunVIR0w+jGah/T51tZMNgS2q3n0JSD+ gIa7F6t2E1Gh+ecBBxcee00GyV+zLcUFCRITss1tlsSmYaPK6gt7V01ZNcDuE0QMZfhk fuZg== X-Forwarded-Encrypted: i=1; AJvYcCWGa5aQzBI5BJ5tqjb10QUB0kxszsXefvm1c7ncrQ1MV5fjgd8uBqAV7jaSCxhuMZZZjTlIsnCyMX8fvSw=@vger.kernel.org X-Gm-Message-State: AOJu0Yy/66yDpfkYE9URghrynOIkZNon+CpFjBd+PPRA6m7BrfU2mOcf LiEBhbhTqJbdfCZLMC4P71//Tfyajll9H0JVUu51qOr569OcJ0spucuEhVC5Fz/tPkn1VD3UKac g7W0RGQ== X-Google-Smtp-Source: AGHT+IF2QlxhQTvDLw9yNHRYaif70evEOY2IZRM9SLZoQdEpIfjcS9U5dtagzcAXoKYnswP0EZJZhx9TCw5X X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:b208:0:b0:e63:6835:99df with SMTP id 3f1490d57ef6-e66a4d37469mr21969276.2.1742625267302; Fri, 21 Mar 2025 23:34:27 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:30 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-3-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 02/35] perf vendor events: Update AlderlakeN events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.28 to v1.29. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/alderlaken/adln-metrics.json | 54 +++++----- .../pmu-events/arch/x86/alderlaken/cache.json | 50 ++++++++++ .../arch/x86/alderlaken/memory.json | 40 ++++++++ .../pmu-events/arch/x86/alderlaken/other.json | 98 ------------------- .../arch/x86/alderlaken/pipeline.json | 11 ++- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 6 files changed, 128 insertions(+), 127 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json b/= tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json index ad04b1e3881e..ce93648043ef 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json @@ -75,7 +75,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { @@ -84,7 +84,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound >0.10)", + "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -95,7 +95,7 @@ "MetricExpr": "(5 * CPU_CLK_UNHALTED.CORE - (TOPDOWN_FE_BOUND.ALL = + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation >0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%" @@ -105,7 +105,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%" }, @@ -114,7 +114,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -123,7 +123,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -131,7 +131,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -139,7 +139,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -148,7 +148,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -156,7 +156,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (5 * CPU_CLK_UNH= ALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -165,7 +165,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound >0.20)", + "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -174,7 +174,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -182,7 +182,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -191,7 +191,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (5 * CPU_CLK_UN= HALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -473,7 +473,7 @@ "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", @@ -506,7 +506,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -514,7 +514,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (5 * CPU_C= LK_UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", + "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -523,7 +523,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { @@ -531,7 +531,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { @@ -539,7 +539,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (5 * CPU_CLK_UNHALTE= D.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -547,7 +547,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (5 * CPU_CLK_UNHALTED.CORE= )", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -555,7 +555,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (5 * CPU_CLK_UNHALTED.= CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -563,7 +563,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (5 * CPU_CLK_UNHALTED.C= ORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { @@ -571,7 +571,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { @@ -579,7 +579,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -589,7 +589,7 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "(tma_retiring >0.75)", + "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -598,7 +598,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/cache.json index fd9ed58c2f90..605d56311dfc 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/cache.json @@ -396,6 +396,26 @@ "SampleAfterValue": "1000003", "UMask": "0x6" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10008", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -436,6 +456,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -476,6 +506,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -516,6 +556,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json b/tools/= perf/pmu-events/arch/x86/alderlaken/memory.json index 3b46b048dfb2..06eca0a45c18 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/memory.json @@ -56,6 +56,16 @@ "SampleAfterValue": "20003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -66,6 +76,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -86,6 +106,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -106,6 +136,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/other.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/other.json index f8c21b7f8f40..0ebcb3e20e1d 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/other.json @@ -8,76 +8,6 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.COREWB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10008", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3,4,5", @@ -107,33 +37,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.SWPF_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x14000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5", - "EventCode": "0xB7", - "EventName": "OCR.SWPF_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x784004000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", - "Counter": "0,1,2,3,4,5", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json b/tool= s/perf/pmu-events/arch/x86/alderlaken/pipeline.json index 713ebc21cec0..0cc8cd203fc3 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json @@ -399,8 +399,9 @@ "UMask": "0x20" }, { - "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "BriefDescription": "This event is deprecated.", "Counter": "0,1,2,3,4,5", + "Deprecated": "1", "EventCode": "0xc3", "EventName": "MACHINE_CLEARS.SLOW", "SampleAfterValue": "20003", @@ -423,6 +424,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 881f418137fd..0ef31b65f8df 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,6 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.29,alderlake,core -GenuineIntel-6-BE,v1.28,alderlaken,core +GenuineIntel-6-BE,v1.29,alderlaken,core GenuineIntel-6-C[56],v1.07,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 163E11940A1 for ; Sat, 22 Mar 2025 06:34:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625279; cv=none; b=rrTn/I1t5hybQw93CpqCWce12ppbNPF+dGK8kWayL/jGT1hGVEU/5cz/HnGoa+q9uOyRiZLOl++E3+DLulqy7DaP/q4UWEnOBpZOfyStCpav/i4wmfYxhygHmYzhv9dopkpWlurnOQs1prQduDc/C1rkES++4iSRJwRVtM6EAhs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625279; c=relaxed/simple; bh=IWYZml0fBzz7dKPN39tsE5TR1VcMDBxwhnFfWeQGVpU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=i2FLa5Ge1uMfLCHIiBRg9dgql9y+iH24Y0OQGsaRq4gwrb1aPoaSub96S1f2UFWUHpiWsN67n/fMI9WJTtfFmJosbUXQm4ypO8pLvXeh3UnElPbui+A2+ldqPIwC40LAS3Pa36zrkPL9rkxcz4pPXOfCGUg92g/7oI4WAzhGfDw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=dDG3iZQy; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="dDG3iZQy" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6f2a2ab50f6so32986597b3.3 for ; Fri, 21 Mar 2025 23:34:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625270; x=1743230070; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=QbwabrjM6SX9kwRVlAmvWlAAAtvcyG0J/geVua6EfBY=; b=dDG3iZQydZWrWoVudVNJEv4uiOjDYEu2F74k9Br4yncyTG9h7MgSpnlW8CmB15iFpH AUJK3CSnMDEOMj6I029Rpzb6WuLZfDptu0xh2xDi6WFfRK9JhZYG+7yh/jS9uA1qb9Jn dwEteVNifgV965A6NOrw66DEiT36HFsngPxsFNyLskHkaFZ25WIE8/Ku5o4wqsnTq7OT J61NUy9Vd2SjTyoAZ6G5G5nlzF9is8H4ST2CLN7rkouxabV2L5cURS1CNLLHLNXveQ0L ZrvdpqY0h6WLaKCZSR0CJUjtR1fjgSICktnJjHjOADw0ywT7PDtvT6Fax3PeRd97S1yE DriA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625270; x=1743230070; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=QbwabrjM6SX9kwRVlAmvWlAAAtvcyG0J/geVua6EfBY=; b=ia7LjfVy9ECB3l2exCFV3YzeKjL+g4z4GtzaPftIsG5xLPmc2U3ykyn1B9XCi037rH CrQ2Bkeu9tLw7raxVReKQ7AMk/vB3NHEQJRRRwSh8iaOz8GJvHM6DLSMvbQ1+qiVTTCm RGMOJGknsmgNTlYXwEijmc1bGMvz12TQ9+ty0WxKcwirPO8tjK5eCubLYmv3jFn3nJ1M qsVKD7pnrnJparMQZf6laAlqlueFpOnN4/2pyUa9nqCTp6nCbODL4O1xSq7+FlzILLrV B1yVEyWFYPgamqS7Ek1NlspYeR9spBAlyLkj2GM7K/vMT1agVx26x7C8HuC4z4+mcfQa 1CZQ== X-Forwarded-Encrypted: i=1; AJvYcCXA7Ud3crWQttyF9OmmbbdNqjwIsF/Cgq2dL5yatm8MtktrllzjMzn6Zwx5mV2ArrIUPY+5xqISsBXwit0=@vger.kernel.org X-Gm-Message-State: AOJu0Yy3GuTiMTre3LFnRmSncOcxvug99HEV/ZjXtBgHx6OA9SoUxGtN GXcTgqBCp4LOksQ7w9kU0C1xTDvrn/lBoyu1+Bm0X2KPyrU41GD5+zCBLKX9aGB7PE1H3plu8mu OXI5P8w== X-Google-Smtp-Source: AGHT+IGleWC7KNh74hALvrM8rjadJHyQ85vtSzv8Iysf76UjQtOxF/uzTSVGhGHJ9jOteHQmsffH6fdPN0yY X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:508d:b0:6fe:f270:fcaa with SMTP id 00721157ae682-700bad159e8mr26517b3.5.1742625269867; Fri, 21 Mar 2025 23:34:29 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:31 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-4-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 03/35] perf vendor events: Update arrowlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.07 to v1.08. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/arrowlake/arl-metrics.json | 566 +++++++++--------- .../pmu-events/arch/x86/arrowlake/cache.json | 148 +++++ .../pmu-events/arch/x86/arrowlake/memory.json | 11 + .../pmu-events/arch/x86/arrowlake/other.json | 193 ------ .../arch/x86/arrowlake/pipeline.json | 163 ++++- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 6 files changed, 608 insertions(+), 475 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json b/to= ols/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json index 7ddb89dd1871..250577dde190 100644 --- a/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json @@ -75,7 +75,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -85,7 +85,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound >0.10)", + "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%", @@ -97,7 +97,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL_P@ / (8 * cpu_= atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation >0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%", @@ -108,7 +108,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -118,7 +118,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -128,7 +128,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -137,7 +137,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (8 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -146,7 +146,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -156,7 +156,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -165,7 +165,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -175,7 +175,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (8 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound >0.20)", + "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -185,7 +185,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -194,7 +194,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -204,7 +204,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -590,7 +590,7 @@ "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@ / cpu_atom@CPU_CL= K_UNHALTED.CORE@", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", "Unit": "cpu_atom" }, { @@ -629,7 +629,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -638,7 +638,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (8 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", + "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -648,7 +648,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -657,7 +657,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -666,7 +666,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -675,7 +675,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -684,7 +684,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -693,7 +693,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -702,7 +702,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -711,7 +711,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -722,7 +722,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / (8 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "(tma_retiring >0.75)", + "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -732,7 +732,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -744,7 +744,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "cpu_core@UOPS_DISPATCHED.ALU@ / (6 * tma_info_threa= d_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -757,13 +757,13 @@ "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -774,7 +774,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -786,18 +786,18 @@ { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-bad\\-spec@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20", @@ -814,11 +814,11 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_capacity + tma_l1_latency_dependency + tma_lock_latency + tma_split_= loads + tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { @@ -826,22 +826,22 @@ "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity= + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_latency_dependency = + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bou= nd * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_fb_ful= l + tma_l1_latency_capacity + tma_l1_latency_dependency + tma_lock_latency = + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bou= nd / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_= split_stores + tma_store_latency + tma_streaming_stores)) + tma_memory_boun= d * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_f= alse_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)= ))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", - "MetricThreshold": "(tma_bottleneck_cache_memory_latency > 20)", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_= clears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_br= anch_mispredicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_u= nknown_branches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_reste= ers + tma_ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth *= tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_co= de", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * t= ma_ms / (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code= ", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", @@ -849,7 +849,7 @@ }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_s= witches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_rest= eers * tma_other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_r= esteers + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses= + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_= dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ls= d + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / = (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_mic= rocode_sequencer / (tma_microcode_sequencer + tma_few_uops_instructions) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite = + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_bra= nch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_n= ukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cpu_= core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / (t= ma_divider + tma_ports_utilization + tma_serializing_operation) + tma_micro= code_sequencer / (tma_microcode_sequencer + tma_few_uops_instructions) * (t= ma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -861,7 +861,7 @@ "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fw= d_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bo= und + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency += tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", - "MetricThreshold": "(tma_bottleneck_memory_data_tlbs > 20)", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", "Unit": "cpu_core" }, @@ -870,13 +870,13 @@ "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", - "MetricThreshold": "(tma_bottleneck_memory_synchronization > 10)", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -888,12 +888,12 @@ "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", - "MetricThreshold": "(tma_bottleneck_other_bottlenecks > 20)", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", "Unit": "cpu_core" }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_microcode_= sequencer + tma_few_uops_instructions) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -902,7 +902,7 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -916,26 +916,26 @@ "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -944,8 +944,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -954,99 +954,99 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@frontend_retired.l1i_miss@R / tma_info_thread_clks - tma_code_l2_miss)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@FRONTEND_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@front= end_retired.l2_miss@R / tma_info_thread_clks", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@FRONT= END_RETIRED.L2_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@frontend_retired.itlb_miss@R / tma_info_thread_clks - tma_code_stlb_mis= s)", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@FRONTEND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_mis= s)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@fro= ntend_retired.stlb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@FRO= NTEND_RETIRED.STLB_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@ / (cpu_core@ITLB_MISSES.= WALK_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_4K@ / (cpu_core@ITLB_MISSES.WAL= K_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@br_misp_retired.cond_ntaken_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@BR_MISP_RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_nt_mispredicts", - "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@ * cpu= _core@br_misp_retired.cond_taken_bwd_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@ * cpu= _core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_bwd_mispredicts", - "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & (tma_bran= ch_mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@ * cpu= _core@br_misp_retired.cond_taken_fwd_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= .", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@ * cpu= _core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_fwd_mispredicts", - "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & (tma_bran= ch_mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ *= cpu_core@mem_load_l3_hit_retired.xsnp_miss@R, cpu_core@MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS@ * (27 * tma_info_system_core_frequency) - 3 * tma_info_syst= em_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_miss@R else= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * (27 * tma_info_system_core_f= requency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM@ * cpu_core@mem_load_l3_hit_retired.xsnp_hitm@R, cpu= _core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * (28 * tma_info_system_core_frequ= ency) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit= _retired.xsnp_hitm@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * (28= * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)) *= (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS= @ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * min(cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_HITM@R, 25 * tma_info_system_core_frequency)) * (1 += cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2= ) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1057,17 +1057,17 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@= * cpu_core@mem_load_l3_hit_retired.xsnp_no_fwd@R, cpu_core@MEM_LOAD_L3_HIT= _RETIRED.XSNP_NO_FWD@ * (27 * tma_info_system_core_frequency) - 3 * tma_inf= o_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_no_fw= d@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * (27 * tma_info_sys= tem_core_frequency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@M= EM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * cpu_core@mem_load_l3_hit_retired.xsnp_fw= d@R, cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (28 * tma_info_system_cor= e_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load= _l3_hit_retired.xsnp_fwd@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ = * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequenc= y)) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1= _MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency)) * (1= + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ /= 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1076,7 +1076,7 @@ "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1086,18 +1086,18 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.MEM@ / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(cpu_core@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x= 1@ + cpu_core@IDQ.DSB_UOPS@ / (cpu_core@IDQ.DSB_UOPS@ + cpu_core@IDQ.MITE_U= OPS@) * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE@ - cpu_core@IDQ_BUBB= LES.FETCH_LATENCY@)) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ + = cpu_core@IDQ.DSB_UOPS@ / (cpu_core@IDQ.DSB_UOPS@ + cpu_core@IDQ.MITE_UOPS@)= * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE@ - cpu_core@IDQ_BUBBLES.F= ETCH_LATENCY@)) / tma_info_thread_clks", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1106,28 +1106,28 @@ "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * cp= u_core@mem_inst_retired.stlb_hit_loads@R, cpu_core@MEM_INST_RETIRED.STLB_HI= T_LOADS@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_loads@R else cpu_co= re@MEM_INST_RETIRED.STLB_HIT_LOADS@ * 7) / tma_info_thread_clks + tma_load_= stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * min(cpu= _core@MEM_INST_RETIRED.STLB_HIT_LOADS@R, 7) / tma_info_thread_clks + tma_lo= ad_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * c= pu_core@mem_inst_retired.stlb_hit_stores@R, cpu_core@MEM_INST_RETIRED.STLB_= HIT_STORES@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_stores@R else cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * 7) / tma_info_thread_clks + tma_= store_stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * min(cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@R, 7) / tma_info_thread_clks + tma_= store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1136,7 +1136,7 @@ "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "(tma_false_sharing > 0.05) & ((tma_store_bound= > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1147,7 +1147,7 @@ "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1158,18 +1158,18 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEN= D_RETIRED.LATENCY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_= dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_miss= es, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1179,7 +1179,7 @@ "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1189,7 +1189,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1199,16 +1199,16 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "cpu_core@ARITH.FPDIV_ACTIVE@ / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1217,8 +1217,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1227,8 +1227,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1237,8 +1237,8 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1247,15 +1247,15 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", @@ -1265,23 +1265,23 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1290,26 +1290,26 @@ "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@br_misp_retired.indirect_call_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= .", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_call_mispredicts", - "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", - "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@br_misp_retired.indirect_cost@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@br_misp_retired.indirect_call_cost@R) / tma_info_thread= _clks, 0)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions.= ", + "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_COST@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread= _clks, 0)", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_jump_mispredicts", - "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1322,7 +1322,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", @@ -1330,29 +1330,29 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= .", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN_BWD@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate).= ", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN_FWD@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -1376,7 +1376,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -1385,7 +1385,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1394,10 +1394,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", "Unit": "cpu_core" }, { @@ -1463,12 +1464,12 @@ "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.V0@ + cpu_core@FP_ARI= TH_DISPATCHED.V1@ + cpu_core@FP_ARITH_DISPATCHED.V2@ + cpu_core@FP_ARITH_DI= SPATCHED.V3@) / (4 * tma_info_thread_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" @@ -1483,15 +1484,15 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= frontend_retired.any_dsb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= FRONTEND_RETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", "MetricGroup": "DSBmiss;Fed;FetchLat", "MetricName": "tma_info_frontend_dsb_switches_ret", "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", @@ -1499,7 +1500,7 @@ }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" @@ -1549,7 +1550,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@fron= tend_retired.ms_flows@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@FRON= TEND_RETIRED.MS_FLOWS@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat;MicroSeq", "MetricName": "tma_info_frontend_ms_latency_ret", "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", @@ -1564,21 +1565,21 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@frontend_retired.unknown_branch@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@FRONTEND_RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat", "MetricName": "tma_info_frontend_unknown_branches_ret", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", @@ -1598,7 +1599,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", "Unit": "cpu_core" }, { @@ -1607,7 +1608,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1616,7 +1617,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1625,7 +1626,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1634,7 +1635,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1697,7 +1698,7 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 17", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1708,6 +1709,13 @@ "MetricName": "tma_info_memory_fb_hpki", "Unit": "cpu_core" }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / tma_info_sys= tem_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_core" + }, { "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", "MetricExpr": "64 * cpu_core@L1D.L0_REPLACEMENT@ / 1e9 / tma_info_= system_time", @@ -1815,7 +1823,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" @@ -1873,7 +1881,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@mem_inst_retired.stlb_miss_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@MEM_INST_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", @@ -1896,7 +1904,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@mem_inst_retired.stlb_miss_stores@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@MEM_INST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", @@ -1935,20 +1943,20 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1993,23 +2001,22 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.FAR_BR= ANCH:u", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / cpu_core@CPU_CLK_UNHA= LTED.THREAD@", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -2053,7 +2060,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", @@ -2064,7 +2071,6 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { @@ -2072,7 +2078,7 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", "Unit": "cpu_core" }, { @@ -2084,7 +2090,7 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" @@ -2102,15 +2108,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "MetricThreshold": "tma_info_thread_uptb < 12", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2120,7 +2126,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2129,8 +2135,8 @@ "MetricExpr": "cpu_core@INT_VEC_RETIRED.128BIT@ / (tma_retiring * = tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2139,8 +2145,8 @@ "MetricExpr": "cpu_core@INT_VEC_RETIRED.256BIT@ / (tma_retiring * = tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2149,8 +2155,8 @@ "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2159,17 +2165,17 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L1@ / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * cpu_cor= e@mem_load_retired.l1_hit_l1@R, cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * 9) i= f 0 < cpu_core@mem_load_retired.l1_hit_l1@R else cpu_core@MEM_LOAD_RETIRED.= L1_HIT_L1@ * 9) / tma_info_thread_clks", + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache.", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * min(cpu_core= @MEM_LOAD_RETIRED.L1_HIT_L1@R, 9) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_capacity", - "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & (tma_l1_bound = > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2178,8 +2184,8 @@ "MetricExpr": "4 * cpu_core@DEPENDENT_LOADS.ANY@ / tma_info_thread= _clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2188,17 +2194,17 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L2@ / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * cpu_core@m= em_load_retired.l2_hit@R, cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info= _system_core_frequency)) if 0 < cpu_core@mem_load_retired.l2_hit@R else cpu= _core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info_system_core_frequency)) * (1= + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ /= 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L2_HIT@R, 3 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2208,18 +2214,18 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L3@ / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * cpu_core@m= em_load_retired.l3_hit@R, cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 * tma_inf= o_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_c= ore@mem_load_retired.l3_hit@R else cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 = * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) * (= 1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ = / 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L3_HIT@R, 9 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ranch_resteers, tma_mem_latency, tma_store_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2228,19 +2234,19 @@ "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2250,7 +2256,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2259,7 +2265,7 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2268,54 +2274,54 @@ "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_1G@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_2M_4M@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_C= OMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_4K@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@me= m_inst_retired.lock_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@ME= M_INST_RETIRED.LOCK_LOADS@R / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "cpu_core@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / = tma_info_thread_clks", + "MetricExpr": "cpu@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / tma_i= nfo_thread_clks", "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2326,17 +2332,17 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_l1_bound= , tma_microcode_sequencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2345,34 +2351,34 @@ "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -2395,14 +2401,14 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + cpu_core@IDQ.MITE_UOPS@ / (cpu_core@IDQ.DSB_UO= PS@ + cpu_core@IDQ.MITE_UOPS@) * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.= CORE@ - cpu_core@IDQ_BUBBLES.FETCH_LATENCY@)) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ /= 2 + cpu_core@IDQ.MITE_UOPS@ / (cpu_core@IDQ.DSB_UOPS@ + cpu_core@IDQ.MITE_= UOPS@) * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE@ - cpu_core@IDQ_BUB= BLES.FETCH_LATENCY@)) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2411,17 +2417,17 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", "MetricExpr": "cpu_core@IDQ.MS_CYCLES_ANY@ / tma_info_thread_clks", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", @@ -2434,7 +2440,7 @@ "MetricExpr": "3 * cpu_core@IDQ.MS_SWITCHES@ / tma_info_thread_clk= s", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2445,7 +2451,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2454,7 +2460,7 @@ "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2470,20 +2476,20 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2493,7 +2499,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2502,8 +2508,8 @@ "MetricExpr": "((cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + (cpu_c= ore@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_3_P= ORTS_UTIL@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_ACTIVE@ < cpu_cor= e@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ else= (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVIT= Y.2_3_PORTS_UTIL@) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2512,8 +2518,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ / tma_info= _thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2522,7 +2528,7 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2533,8 +2539,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2544,24 +2550,24 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@br_mi= sp_retired.ret_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@BR_MI= SP_RETIRED.RET_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ret_mispredicts", - "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & (tma_branch_mispr= edicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2575,8 +2581,8 @@ "MetricExpr": "(cpu_core@BE_STALLS.SCOREBOARD@ + cpu_core@CPU_CLK_= UNHALTED.C02@) / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2585,8 +2591,8 @@ "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2596,28 +2602,28 @@ "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * cpu_c= ore@mem_inst_retired.split_loads@R, cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ = * tma_info_memory_load_miss_real_latency) if 0 < cpu_core@mem_inst_retired.= split_loads@R else cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * tma_info_memory= _load_miss_real_latency) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * min(cpu_co= re@MEM_INST_RETIRED.SPLIT_LOADS@R, tma_info_memory_load_miss_real_latency) = / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * cpu_= core@mem_inst_retired.split_stores@R, cpu_core@MEM_INST_RETIRED.SPLIT_STORE= S@) if 0 < cpu_core@mem_inst_retired.split_stores@R else cpu_core@MEM_INST_= RETIRED.SPLIT_STORES@) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * min(cpu_c= ore@MEM_INST_RETIRED.SPLIT_STORES@R, 1) / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2626,8 +2632,8 @@ "MetricExpr": "(cpu_core@XQ.FULL@ + cpu_core@L1D_MISS.L2_STALLS@) = / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2636,8 +2642,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2646,8 +2652,8 @@ "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2656,8 +2662,8 @@ "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2667,7 +2673,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2676,7 +2682,7 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2685,34 +2691,34 @@ "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= thread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_1G@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_2M_4M@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_= core@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.W= ALK_COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_4K@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2721,7 +2727,7 @@ "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2731,7 +2737,7 @@ "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2741,8 +2747,8 @@ "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/cache.json index f63594b2cca8..f9ba410d4b94 100644 --- a/tools/perf/pmu-events/arch/x86/arrowlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/arrowlake/cache.json @@ -8,6 +8,16 @@ "SampleAfterValue": "1000003", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of L1D cacheline (dirty) ev= ictions caused by load misses, stores, and prefetches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x51", + "EventName": "DL1.DIRTY_EVICTION", + "PublicDescription": "Counts the number of L1D cacheline (dirty) e= victions caused by load misses, stores, and prefetches. Does not count evi= ctions or dirty writebacks caused by snoops. Does not count a replacement = unless a (dirty) line was written back.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Counts the number of cache lines replaced in = L0 data cache.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -18,6 +28,16 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Cachelines replaced into the L0 and L1 d-cach= e. Successful replacements only (not blocked) and exclude WB-miss case", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.REPLACEMENT", + "PublicDescription": "Counts cachelines replaced into the L0 and L= 1 d-cache.", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_core" + }, { "BriefDescription": "Number of cycles a demand request has waited = due to L1D Fill Buffer (FB) unavailability.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -79,6 +99,46 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Exclusive state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.E", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Exclusive state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Forward state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.F", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Forward state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Modified state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.M", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Modified state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Shared state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.S", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Shared state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -89,6 +149,16 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = evicted due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= evicted due to an L2 cache fill. Increments on the core that brought the l= ine in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -99,6 +169,16 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = silently dropped due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= silently dropped due to an L2 cache fill. Increments on the core that bro= ught the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -128,6 +208,15 @@ "UMask": "0xff", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that r= esulted in a Hit from a front door request only (does not include rejects o= r recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -138,6 +227,34 @@ "UMask": "0x3f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of total L2 Cache Accesses = that resulted in a Miss from a front door request only (does not include re= jects or recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that m= iss the L2 and get BBL reject short and long rejects (includes those count= ed in L2_reject_XQ.any), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.REJECTS", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "L2 code requests", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_CODE_RD", + "PublicDescription": "Counts the total number of L2 code requests.= ", + "SampleAfterValue": "200003", + "UMask": "0xe4", + "Unit": "cpu_core" + }, { "BriefDescription": "Demand Data Read access L2 cache", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -408,6 +525,15 @@ "UMask": "0x78", "Unit": "cpu_lowpower" }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled to a store buffer full condition", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.SBFULL", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, { "BriefDescription": "Counts all retired load instructions.", "Counter": "0,1,2,3", @@ -1245,6 +1371,17 @@ "UMask": "0xf", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", "Counter": "0,1,2,3", @@ -1267,6 +1404,17 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/memory.json b/tools/p= erf/pmu-events/arch/x86/arrowlake/memory.json index 08f01fc66fef..f7e202dec84a 100644 --- a/tools/perf/pmu-events/arch/x86/arrowlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/arrowlake/memory.json @@ -332,6 +332,17 @@ "UMask": "0x4", "Unit": "cpu_lowpower" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1E780000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/other.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/other.json index 0175b2193201..f4fb51bb95ff 100644 --- a/tools/perf/pmu-events/arch/x86/arrowlake/other.json +++ b/tools/perf/pmu-events/arch/x86/arrowlake/other.json @@ -18,71 +18,6 @@ "UMask": "0x8", "Unit": "cpu_core" }, - { - "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa2", - "EventName": "BE_STALLS.SCOREBOARD", - "SampleAfterValue": "100003", - "UMask": "0x2", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x02", - "EventName": "DEPENDENT_LOADS.ANY", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of uops executed on seconda= ry integer ports 0,1,2,3.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.2ND", - "SampleAfterValue": "1000003", - "UMask": "0x80", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on a load = port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.LD", - "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.PRIMARY", - "SampleAfterValue": "1000003", - "UMask": "0x78", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on a Store= address port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.STA", - "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.STD_JMP", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, { "BriefDescription": "This event is deprecated. [This event is alia= s to MISC_RETIRED.LBR_INSERTS]", "Counter": "0,1,2,3,4,5,6,7", @@ -93,75 +28,6 @@ "UMask": "0x1", "Unit": "cpu_lowpower" }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L1", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L2", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L3", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.MEM", - "SampleAfterValue": "1000003", - "UMask": "0x8", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1E780000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3,4,5,6,7", @@ -206,65 +72,6 @@ "UMask": "0x1", "Unit": "cpu_core" }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4", - "Unit": "cpu_lowpower" - }, - { - "BriefDescription": "Counts the number of issue slots where no uop= could issue due to an IQ scoreboard that stalls allocation until a specifi= ed older uop retires or (in the case of jump scoreboard) executes. Commonly= executed instructions with IQ scoreboards include LFENCE and MFENCE.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.IQ_JEU_SCB", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3,4,5,6,7,8,9", diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/arrowlake/pipeline.json index 6dbde51e7ead..739efb199668 100644 --- a/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json @@ -51,6 +51,15 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", @@ -888,6 +897,15 @@ "UMask": "0x4", "Unit": "cpu_core" }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, { "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1139,6 +1157,53 @@ "UMask": "0x10", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of uops executed on seconda= ry integer ports 0,1,2,3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.2ND", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1405,8 +1470,9 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "BriefDescription": "This event is deprecated.", "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", "EventCode": "0xc3", "EventName": "MACHINE_CLEARS.SLOW", "SampleAfterValue": "20003", @@ -1432,6 +1498,42 @@ "UMask": "0x1", "Unit": "cpu_lowpower" }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, { "BriefDescription": "LFENCE instructions retired", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1460,6 +1562,65 @@ "UMask": "0x1", "Unit": "cpu_lowpower" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots where no uop= could issue due to an IQ scoreboard that stalls allocation until a specifi= ed older uop retires or (in the case of jump scoreboard) executes. Commonly= executed instructions with IQ scoreboards include LFENCE and MFENCE.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.IQ_JEU_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 0ef31b65f8df..1b592cf63940 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,7 +1,7 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.29,alderlake,core GenuineIntel-6-BE,v1.29,alderlaken,core -GenuineIntel-6-C[56],v1.07,arrowlake,core +GenuineIntel-6-C[56],v1.08,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4669C18DF62 for ; Sat, 22 Mar 2025 06:34:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625275; cv=none; b=ROzbpEtK72RcHaJRaabBrX0g/NggsddC/XXvF3ZWIjv5KBP1KXmq0PRuL+NWVO5NwxIeQyiDAS/Fer+bgQ/UkWpwE7noJB9IDh5VcAZsX294Rd3y5zOaQL5fWeFA25a4A6So/ZR2RcuwKkKQEsn2vUAIg0+F19Kx2su68TbgoDQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625275; c=relaxed/simple; bh=gzOhUwuk6PRtqJT004lFT8JmHmf233ItpyrfAz7UByY=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=k10YtizsLseBMWYVU5qwdCCZPvup3TrEWf9LsmoWsXn0lD/tBfPdPUpxaQT61Tso/kxaLwrx/7VmE2D8JCxdZReeA0qf4o8FPWMnEQYFWrCKGNtc8oKpRGXFBNYD6e/0Hi3iTrI/tAfFeJ6EiqqAOmTNu1R9k7XNfIEYHXxpFnA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=sv6SlrbF; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="sv6SlrbF" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f2737d115eso32913647b3.2 for ; Fri, 21 Mar 2025 23:34:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625273; x=1743230073; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=6RXv0a2X2Eq8/LheD1NmgszijzevuXG/6uAHQeKi4nU=; b=sv6SlrbFrg8iuKV2ImtsDiRkbo5I0frBUahs3rqDpU//h9PfU79oH+njG7RjVXHVoQ wiLIMEJYhOn6KKhv5oQDSa/erO3gkWoDZF0YOJhQ8OggUlOjOqJK0Kd6hMuMbtKCyEHk b2sS4ALe7vzj/OirNchS4NyKBMm17VvnaAPJ43ZQ4eZjCwtvVJfGG6P1MMObAMy27lDI BDa2tbBdIv6peaRLl8w3tuxfQhckaeCm9E67a8jKTfo0Ziqy1B8wY4TvpiMfUpJ9SaU9 osjGK6/DRMto1ohTVxp+P8iegL2EUIQcJQMTmukPHK6dwIxLP973KHvvFyjv8AoNmmvs 8Rlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625273; x=1743230073; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6RXv0a2X2Eq8/LheD1NmgszijzevuXG/6uAHQeKi4nU=; b=h1po94KTiBSKKidWcoXUxPbGCTnBfahJPdz3LVFMTM0QuSEUGd5Cn7PGMLIdxJiTzr 4Hn7pS0pUgcUtSO+0mWrp7hVuO7nITczy9Ff6sEDqg7vLBMfg2MNszW3DLitfwe0k+nf qjyOjJvBWSA+GFsOD3M4JMoPS6r59drakf/4sLfgQPY9QQVsMeHAeI/9D7OynUM/oZoH ExjTeZDkHcaT/YAaQoHNbLrQzxD9HSwYUS4q9rFA/H388l9l5OHCwBTtW/z9tYoWhU+D Lvl6NgjXghlpVQ3W2Ax0hIu2P1aRATfygVtD6tIxhAzjVu7AWvuh4f161D7yo6YAFYce ei6w== X-Forwarded-Encrypted: i=1; AJvYcCWXeluIs0zxDSX0RYQszqyoJeDcWKBzBRuXqgYovSo5nGYl87y/7v7076rYdE+lrweDg1KLGYGaljAzAw8=@vger.kernel.org X-Gm-Message-State: AOJu0Yx1f0msGtc+eT//YW3Xcr82YNxckho+yUUB7Zos+VZp27deK+tX l7/KrX2YDajT9el/CSO+QrsefEy0pT11IAEtn3MQVqRFgprjIizYpuxpoLt3bVVE/vh13UimTwM 8avb5Ag== X-Google-Smtp-Source: AGHT+IGoRZBLAATkk6CrzIJdrb/6KBcOQJLgFpEERB3mzffVqwH52k6KG5j1d1MvINvdpWH0bTkiN2xUQGvg X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a81:a844:0:b0:6f9:8797:a245 with SMTP id 00721157ae682-700baced0a6mr29167b3.3.1742625272412; Fri, 21 Mar 2025 23:34:32 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:32 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-5-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 04/35] perf vendor events: Update bonnell events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Move DISPATCH_BLOCKED.ANY to the pipeline topic. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/bonnell/other.json | 8 -------- tools/perf/pmu-events/arch/x86/bonnell/pipeline.json | 8 ++++++++ 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/bonnell/other.json b/tools/perf= /pmu-events/arch/x86/bonnell/other.json index 3a55c101fbf7..6e6f64b96834 100644 --- a/tools/perf/pmu-events/arch/x86/bonnell/other.json +++ b/tools/perf/pmu-events/arch/x86/bonnell/other.json @@ -323,14 +323,6 @@ "SampleAfterValue": "2000000", "UMask": "0x2" }, - { - "BriefDescription": "Memory cluster signals to block micro-op disp= atch for any reason", - "Counter": "0,1", - "EventCode": "0x9", - "EventName": "DISPATCH_BLOCKED.ANY", - "SampleAfterValue": "200000", - "UMask": "0x20" - }, { "BriefDescription": "Number of Enhanced Intel SpeedStep(R) Technol= ogy (EIST) transitions", "Counter": "0,1", diff --git a/tools/perf/pmu-events/arch/x86/bonnell/pipeline.json b/tools/p= erf/pmu-events/arch/x86/bonnell/pipeline.json index 9ff032ab11e2..48d3d053a369 100644 --- a/tools/perf/pmu-events/arch/x86/bonnell/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/bonnell/pipeline.json @@ -211,6 +211,14 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, + { + "BriefDescription": "Memory cluster signals to block micro-op disp= atch for any reason", + "Counter": "0,1", + "EventCode": "0x9", + "EventName": "DISPATCH_BLOCKED.ANY", + "SampleAfterValue": "200000", + "UMask": "0x20" + }, { "BriefDescription": "Divide operations retired", "Counter": "0,1", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A47B01AA791 for ; Sat, 22 Mar 2025 06:34:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625282; cv=none; b=Xry1b0XWu3OtrlBQGThS2lfxQ9hEC/rnsBd8kTYs9Zy1FHN0IP8xdHJRqGLDLPHMULBbALnksJE34dPr6fbJyQ8D+6eYF5+SEgRofOeGz5JQWD1JWR0KwyOac68HalTyNRyzSSrNiv9yredOyww/kbrpyhQoqpF8kdvgDxIdb+E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625282; c=relaxed/simple; bh=U6fdboGKnD4AJzFkvDZ3+eZqfEtKEmb9evjgAcZCmj4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=hL/yIeQvrW3p5bUIYFbnYmc7n5rV2rwtE1a6BS5TAzxt9pDSa1ybXY7815MG2yxK6N+ilgLGrJtUqX6phHrNjcWlfOSd4DVupd6HIUxWoU0SDSEAQUA06eWykmAIecBmfz8dd54rEyzabWoBgO4hbnaUG0OrtONElESruUOgF/U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=TVW3y9j4; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="TVW3y9j4" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6feeba593dbso35347277b3.1 for ; Fri, 21 Mar 2025 23:34:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625275; x=1743230075; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=a5jocNXaBG8YnVIsRZJDaZAuXaovCUJdQml+Wnx7Beg=; b=TVW3y9j4IH7xtgT6CqV9Znx8tYNtN+RTBDIUhX6shbgKygVt8l1Pj9NIDqJgKekW+K sOXOv9qryxx2PZFjCwLXjpYCZMldZXpZAiQ98rSTPjr1+VzZKGECTiAxwvEn+flObc0m LQDS2E2vlvKWosUppL9bjcAq68ND3Aexob0ouiA9sbQh/57i3h82fS8IL1U72Jq0gGNf qWzZb7BHzgQ9O+H9hrrf9ZOjH/zk3l48OOZuw8U93PxEnUfmhSwPrP9BhPjoYkOcr0fy i20oh9zzOA0e5BdyqfcXhlc8eYN6REbwmGifcbLwnlZqbLM6nHrJXvSxhQ8ZQCxxj/JP L+Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625275; x=1743230075; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=a5jocNXaBG8YnVIsRZJDaZAuXaovCUJdQml+Wnx7Beg=; b=E/OofKrqaH9A/DMmOvIRAFArAkSeWc5GlXJivAae2geGB2l2dYK3xeBc2f5m11ozWt OTPxSwkLH606tfNY71N87HFMQGldpPaoSOMsCaUwjtoE1SpkjAZtKesjg/gDk0PmqyDR XbNRiTa2ZKKjwB4zWCowWgTrMmpDzwf7W0TFDXy13X9FAQ2PqTFUgLjNQue11LlpW902 4iid1CRKjOsk/Kf0tWnICkFlEzBKnRiMu40dPtedAesBYkUBPZKeqGl9A5LclnIBaqtD 8J1ij5LPhQbaBYsNsu4e+FJrmcf+UIhw4w1BX/8B2c+PqmpKccGCvU+LLagrdqHmZt7W mOGw== X-Forwarded-Encrypted: i=1; AJvYcCWg/348BPevRXXiSypAUNH0woF6D67trCcD5pGb46UWvQxC0m4C6L1MlY+C9H228cfyDthOvKSsZ0cSAc4=@vger.kernel.org X-Gm-Message-State: AOJu0YyKUHpX1/ybX/nFqJ0BbJtevUHHYJkwW0fqC17N4kthNTUbKvT9 DFv38k49ri03qLLmMFvvp1Nt5T3BCKiCN2iER0dCl7o9+XX27wZvEDahNTGPYmSlIlqfy0GjDHk Lc8O4MA== X-Google-Smtp-Source: AGHT+IEvSPBSSjqiJH5p4f9eVTUR0iu7MCe1Yh1Mn7bsAh4Kn/sTYB1TbTydYPYcllJ+rKFCi2k0rBdGW94k X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a81:b28a:0:b0:6fb:b2fb:575 with SMTP id 00721157ae682-700bad0aa03mr688237b3.7.1742625275327; Fri, 21 Mar 2025 23:34:35 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:33 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-6-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 05/35] perf vendor events: Update broadwell metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/broadwell/bdw-metrics.json | 256 +++++++++--------- 1 file changed, 127 insertions(+), 129 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json b/to= ols/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json index 40970fa5566c..89750117a7f6 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -143,8 +143,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -162,8 +162,8 @@ "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { @@ -174,7 +174,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { @@ -183,8 +183,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -192,8 +192,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -202,8 +202,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { @@ -212,7 +212,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -220,26 +220,26 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -247,18 +247,18 @@ "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM = / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Rela= ted metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -287,7 +287,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -295,8 +295,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -304,8 +304,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -313,8 +313,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -322,8 +322,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -333,33 +333,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -370,7 +370,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -391,11 +391,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -420,7 +420,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -438,7 +438,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -446,7 +446,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -454,7 +454,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -462,7 +462,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -470,7 +470,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -512,7 +512,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -634,20 +634,20 @@ }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + c= pu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + cpu@DTLB_STORE_MISSES.= WALK_DURATION\\,cmask\\=3D0x1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DT= LB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_cor= e_core_clks", + "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -688,14 +688,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -743,7 +742,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -752,15 +751,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -786,14 +784,14 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, @@ -802,8 +800,8 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -811,8 +809,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { @@ -821,8 +819,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { @@ -831,8 +829,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", "ScaleUnit": "100%" }, { @@ -840,18 +838,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -870,8 +868,8 @@ "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -882,15 +880,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -899,7 +897,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -911,7 +909,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { @@ -928,8 +926,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -938,7 +936,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { @@ -946,8 +944,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -956,7 +954,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -965,7 +963,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1001,7 +999,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1010,7 +1008,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1028,43 +1026,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1084,7 +1082,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1092,8 +1090,8 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1101,7 +1099,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1110,8 +1108,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1119,8 +1117,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1129,8 +1127,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1146,7 +1144,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1155,8 +1153,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } ] --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4A5A1C1ADB for ; Sat, 22 Mar 2025 06:34:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625286; cv=none; b=sUsD9ZzYJkzMl7o7j1krPcehPmyIZGhTtpzhvo0r6DK0RbI9FxAvcZo5dvvWbA+sDCuswB7uYNg7f/soVpwmWeGqfvZh+cqiPRdKsYes4tnTEJh+1eXgDhIhkOvSmtQTUEHICSQkspIFolL3gjquyTIaJGxA1VG8jj4auUxRtJc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625286; c=relaxed/simple; bh=qJ/gTxANpq7Ae1pZopFMcULf+/ebK0e0kNg5YFqePCc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Tnk0Bg0UpzB9aQdSk1ttj5PTZIcFeZh+6+l8gLSUmjKGxBK6mMA1tIdT0DPk7dGt4SZeVk9BXJb9ZodOSwsPMBHk6ysCiMfe9L/Lv4ap5VR3e0xHoyaaIAO3Vu8tIuH3qKW1+c0ffAoyv39sgYDj1TyN5f+wxcnOZ90MEk3WcSw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=1QkszqOE; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="1QkszqOE" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e582bfcada6so4392377276.1 for ; Fri, 21 Mar 2025 23:34:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625278; x=1743230078; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=vdDcHKM0jXJHo/S/f6j1ZXtaUryEcuHG2tCg7U7Ajnk=; b=1QkszqOEuUK78Wf8GYOBRuDA9CGXTrKh4AwQuQ+e0KmvhNflfLwUp6VOXpyNkj31UQ iGkFCb0v74pnuNSNTdlIGZfV4YaG5WGuYFe7ETNLz8+7yI1YFPAl+pJp6oIe0mrB9MNA K/LD728RFEKIJ5bR+djdZ1u8L7HLPHmSTwcQOCiDqiJSmnzZKdq51MefM0Pu1dB99Dzj G2IODyogt+jGSfcRueWUMdqP15SRv0EJEvlMUKunqPujeVCmTYMVfOl2l+QjrP/HlK+b 8pkGJdjtxRXl5gmLYbEhesd4UYP7QPm4npT5OLhuDiVi8ZLgvI2Pa2yOSwG72yXXoN0c hU7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625278; x=1743230078; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=vdDcHKM0jXJHo/S/f6j1ZXtaUryEcuHG2tCg7U7Ajnk=; b=wRBc3JMzMEvWmTc25Acy19V6SxGGHPnC9RjQ2+HXtYvsr7pyWl5c15acCzPX4nwUbK clrsxZq8bYdu6VG1AzwLfeJnZD3B8WfaIX/GMajVy0/+DR5gE+EHL6gEaTUMIadQWry1 wSmyqbn1SaqZfn2bRu8IV3D+16MGoyncqM4RWNUTbVdHdHtpGBvfqr40czHVz9WnBMkz 6wH7vveJvAf8Oc7DcSoCh0oDocRV9G6AVEXzuVvpEDU9DmfbyOYDW3xYES+X7mZ1Qr7T gzv9AdNHnKIaw5hmctvZ7Dnnqvjq2N6neKzyYBhjxkxQ0CQPq1is5qlH9QvPLEwwyC0F CSSw== X-Forwarded-Encrypted: i=1; AJvYcCXftz0DUAzRdp2nYV3mLltTE5uLzddEZuS7qsYp29usEG43wIzucHNRnBMA1nrZOoxUZavsIj4VvO2LDDQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yw9LwH/u4QecB/K3pWADAnk6AiEtUPO/EDDgUwA+qbp6i9iO65a 8o1XRk2X+LEnCgi0UfA98qz+GqkYNliz7Uv4mWYJVZLykc4YLlnyJ++HnqSXuoqrt6lX0wRPRNb fqxzkkA== X-Google-Smtp-Source: AGHT+IGjyZySAHnBl9jLnRebFB0ikIGdN6VQhmqFa3WCXHLNqwPTnK7+a4J3iX6WEc4w/Xvb2A2hLeebHCNu X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:3385:0:b0:e60:8941:a7a1 with SMTP id 3f1490d57ef6-e66a4abab89mr6841276.0.1742625277645; Fri, 21 Mar 2025 23:34:37 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:34 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-7-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 06/35] perf vendor events: Update broadwellde metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/broadwellde/bdwde-metrics.json | 180 +++++++++--------- 1 file changed, 90 insertions(+), 90 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json = b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json index b03a5f2bcd82..81175f0f2603 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json @@ -74,7 +74,7 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "(tma_4k_aliasing > 0.2) & ((tma_l1_bound > 0.1= ) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, @@ -84,7 +84,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", - "MetricThreshold": "(tma_alu_op_utilization > 0.4)", + "MetricThreshold": "tma_alu_op_utilization > 0.4", "ScaleUnit": "100%" }, { @@ -92,7 +92,7 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "(tma_assists > 0.1) & ((tma_microcode_sequence= r > 0.05) & ((tma_heavy_operations > 0.1)))", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma= _retiring)", "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound > 0.2)", + "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" @@ -112,7 +112,7 @@ "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * = (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)= ) / tma_info_thread_slots", "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" @@ -123,7 +123,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts > 0.1) & ((tma_bad_spe= culation > 0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" @@ -133,7 +133,7 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "(tma_branch_resteers > 0.05) & ((tma_fetch_lat= ency > 0.1) & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, @@ -143,7 +143,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc > 0.1) & ((tma_microcode_sequencer >= 0.05) & ((tma_heavy_operations > 0.1)))", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "(tma_clears_resteers > 0.05) & ((tma_branch_re= steers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15))= ))", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -162,7 +162,7 @@ "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "(tma_contested_accesses > 0.05) & ((tma_l3_bou= nd > 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", "ScaleUnit": "100%" }, @@ -172,7 +172,7 @@ "MetricExpr": "tma_backend_bound - tma_memory_bound", "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound > 0.1) & ((tma_backend_bound >= 0.2))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" @@ -183,7 +183,7 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "(tma_data_sharing > 0.05) & ((tma_l3_bound > 0= .05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, @@ -192,7 +192,7 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "(tma_divider > 0.2) & ((tma_core_bound > 0.1) = & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%" }, @@ -202,7 +202,7 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "(tma_dram_bound > 0.1) & ((tma_memory_bound > = 0.2) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -211,7 +211,7 @@ "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4= _UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", - "MetricThreshold": "(tma_dsb > 0.15) & ((tma_fetch_bandwidth > 0.2= ))", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, @@ -220,7 +220,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "(tma_dsb_switches > 0.05) & ((tma_fetch_latenc= y > 0.1) & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_frontend_dsb_cove= rage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -229,7 +229,7 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "(tma_dtlb_load > 0.1) & ((tma_l1_bound > 0.1) = & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, @@ -238,7 +238,7 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "(tma_dtlb_store > 0.05) & ((tma_store_bound > = 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, @@ -248,7 +248,7 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", - "MetricThreshold": "(tma_fb_full > 0.3)", + "MetricThreshold": "tma_fb_full > 0.3", "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, @@ -257,7 +257,7 @@ "MetricExpr": "tma_frontend_bound - tma_fetch_latency", "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", - "MetricThreshold": "(tma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" @@ -267,7 +267,7 @@ "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE= / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", - "MetricThreshold": "(tma_fetch_latency > 0.1) & ((tma_frontend_bou= nd > 0.15))", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" @@ -277,7 +277,7 @@ "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", - "MetricThreshold": "(tma_fp_arith > 0.2) & ((tma_light_operations = > 0.6))", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, @@ -286,7 +286,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "(tma_fp_scalar > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -295,7 +295,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "(tma_fp_vector > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -304,7 +304,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "(tma_fp_vector_128b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -313,7 +313,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "(tma_fp_vector_256b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -322,7 +322,7 @@ "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_thread_slots= ", "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" @@ -332,7 +332,7 @@ "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", - "MetricThreshold": "(tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" @@ -342,7 +342,7 @@ "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses > 0.05) & ((tma_fetch_laten= cy > 0.1) & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, @@ -351,14 +351,14 @@ "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "(tma_info_bad_spec_ipmisp_indirect < 1000)" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", - "MetricThreshold": "(tma_info_bad_spec_ipmispredict < 200)" + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", @@ -396,7 +396,7 @@ "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_= UOPS + IDQ.MS_UOPS)", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", - "MetricThreshold": "(tma_info_frontend_dsb_coverage < 0.7) & ((tma= _info_thread_ipc / 4) > 0.35)", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 4 > 0.35", "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_inst_mix_iptb, tma_lcp" }, { @@ -429,7 +429,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", - "MetricThreshold": "(tma_info_inst_mix_iparith < 10)", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { @@ -437,7 +437,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", - "MetricThreshold": "(tma_info_inst_mix_iparith_avx128 < 10)", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { @@ -445,7 +445,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", - "MetricThreshold": "(tma_info_inst_mix_iparith_avx256 < 10)", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { @@ -453,7 +453,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", - "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_dp < 10)", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -461,7 +461,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", - "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_sp < 10)", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -469,42 +469,42 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", - "MetricThreshold": "(tma_info_inst_mix_ipbranch < 8)" + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8" }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", - "MetricThreshold": "(tma_info_inst_mix_ipcall < 200)" + "MetricThreshold": "tma_info_inst_mix_ipcall < 200" }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", - "MetricThreshold": "(tma_info_inst_mix_ipflop < 10)" + "MetricThreshold": "tma_info_inst_mix_ipflop < 10" }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", - "MetricThreshold": "(tma_info_inst_mix_ipload < 3)" + "MetricThreshold": "tma_info_inst_mix_ipload < 3" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", - "MetricThreshold": "(tma_info_inst_mix_ipstore < 8)" + "MetricThreshold": "tma_info_inst_mix_ipstore < 8" }, { "BriefDescription": "Instructions per taken branch", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "(tma_info_inst_mix_iptb < 4 * 2 + 1)", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -629,7 +629,7 @@ "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", - "MetricThreshold": "(tma_info_memory_tlb_page_walks_utilization > = 0.5)" + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { "BriefDescription": "", @@ -680,7 +680,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "(tma_info_system_ipfarbranch < 1000000)" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", @@ -693,14 +693,14 @@ "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", - "MetricThreshold": "(tma_info_system_kernel_utilization > 0.05)" + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, { "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", "MetricGroup": "Summary", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { "BriefDescription": "Total package Power in Watts", @@ -725,7 +725,7 @@ "MetricExpr": "duration_time", "MetricGroup": "Summary", "MetricName": "tma_info_system_time", - "MetricThreshold": "(tma_info_system_time < 1)" + "MetricThreshold": "tma_info_system_time < 1" }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", @@ -769,21 +769,21 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", - "MetricThreshold": "(tma_info_thread_uoppi > 1.05)" + "MetricThreshold": "tma_info_thread_uoppi > 1.05" }, { "BriefDescription": "Uops per taken branch", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "(tma_info_thread_uptb < 4 * 1.5)" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, @@ -792,7 +792,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "(tma_l1_bound > 0.1) & ((tma_memory_bound > 0.= 2) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -801,7 +801,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "(tma_l2_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -811,7 +811,7 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "(tma_l3_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, @@ -821,7 +821,7 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "(tma_l3_hit_latency > 0.1) & ((tma_l3_bound > = 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_mem_latency", "ScaleUnit": "100%" }, @@ -830,7 +830,7 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "(tma_lcp > 0.05) & ((tma_fetch_latency > 0.1) = & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, @@ -839,7 +839,7 @@ "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", - "MetricThreshold": "(tma_light_operations > 0.6)", + "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" @@ -850,7 +850,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT= .PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 *= tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", - "MetricThreshold": "(tma_load_op_utilization > 0.6)", + "MetricThreshold": "tma_load_op_utilization > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%" }, @@ -860,7 +860,7 @@ "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "(tma_lock_latency > 0.2) & ((tma_l1_bound > 0.= 1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -870,7 +870,7 @@ "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears > 0.1) & ((tma_bad_specula= tion > 0.15))", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" @@ -880,7 +880,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "(tma_mem_bandwidth > 0.2) & ((tma_dram_bound >= 0.1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -889,7 +889,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "(tma_mem_latency > 0.1) & ((tma_dram_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -899,7 +899,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB= ) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UO= PS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ipc > 1.8 else UOPS_EX= ECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", - "MetricThreshold": "(tma_memory_bound > 0.2) & ((tma_backend_bound= > 0.2))", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" @@ -909,7 +909,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.M= S_UOPS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", - "MetricThreshold": "(tma_microcode_sequencer > 0.05) & ((tma_heavy= _operations > 0.1))", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, @@ -918,7 +918,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "(tma_mispredicts_resteers > 0.05) & ((tma_bran= ch_resteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0= .15))))", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= ", "ScaleUnit": "100%" }, @@ -927,7 +927,7 @@ "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES= _4_UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", - "MetricThreshold": "(tma_mite > 0.1) & ((tma_fetch_bandwidth > 0.2= ))", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, @@ -936,7 +936,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "(tma_ms_switches > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -945,7 +945,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_core_cl= ks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", - "MetricThreshold": "(tma_port_0 > 0.6)", + "MetricThreshold": "tma_port_0 > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -954,7 +954,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", - "MetricThreshold": "(tma_port_1 > 0.6)", + "MetricThreshold": "tma_port_1 > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -963,7 +963,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_2", - "MetricThreshold": "(tma_port_2 > 0.6)", + "MetricThreshold": "tma_port_2 > 0.6", "ScaleUnit": "100%" }, { @@ -971,7 +971,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_3", - "MetricThreshold": "(tma_port_3 > 0.6)", + "MetricThreshold": "tma_port_3 > 0.6", "ScaleUnit": "100%" }, { @@ -979,7 +979,7 @@ "MetricExpr": "tma_store_op_utilization", "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_= utilization_group", "MetricName": "tma_port_4", - "MetricThreshold": "(tma_port_4 > 0.6)", + "MetricThreshold": "tma_port_4 > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 4 (Store-data). Related metrics: t= ma_split_stores", "ScaleUnit": "100%" }, @@ -988,7 +988,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", - "MetricThreshold": "(tma_port_5 > 0.6)", + "MetricThreshold": "tma_port_5 > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, t= ma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, = tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -997,7 +997,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", - "MetricThreshold": "(tma_port_6 > 0.6)", + "MetricThreshold": "tma_port_6 > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, = tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5,= tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -1006,7 +1006,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", "MetricName": "tma_port_7", - "MetricThreshold": "(tma_port_7 > 0.6)", + "MetricThreshold": "tma_port_7 > 0.6", "ScaleUnit": "100%" }, { @@ -1015,7 +1015,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "(tma_ports_utilization > 0.15) & ((tma_core_bo= und > 0.1) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, @@ -1024,7 +1024,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "(tma_ports_utilized_0 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, @@ -1033,7 +1033,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "(tma_ports_utilized_1 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1042,7 +1042,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "(tma_ports_utilized_2 > 0.15) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1051,7 +1051,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "(tma_ports_utilized_3m > 0.4) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, @@ -1060,7 +1060,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_thread_slots", "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "((tma_retiring > 0.7)|(tma_heavy_operations > = 0.1))", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" @@ -1071,7 +1071,7 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "(tma_split_loads > 0.3)", + "MetricThreshold": "tma_split_loads > 0.3", "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, @@ -1080,7 +1080,7 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "(tma_split_stores > 0.2) & ((tma_store_bound >= 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, @@ -1089,7 +1089,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "(tma_sq_full > 0.3) & ((tma_l3_bound > 0.05) &= ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1098,7 +1098,7 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "(tma_store_bound > 0.2) & ((tma_memory_bound >= 0.2) & ((tma_backend_bound > 0.2)))", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, @@ -1107,7 +1107,7 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "(tma_store_fwd_blk > 0.1) & ((tma_l1_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, @@ -1117,7 +1117,7 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "(tma_store_latency > 0.1) & ((tma_store_bound = > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, @@ -1126,7 +1126,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", - "MetricThreshold": "(tma_store_op_utilization > 0.6)", + "MetricThreshold": "tma_store_op_utilization > 0.6", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, @@ -1135,7 +1135,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "(tma_unknown_branches > 0.05) & ((tma_branch_r= esteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15)= )))", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -1144,7 +1144,7 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "(tma_x87_use > 0.1) & ((tma_fp_arith > 0.2) & = ((tma_light_operations > 0.6)))", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E3F61C5D7D for ; Sat, 22 Mar 2025 06:34:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625287; cv=none; b=MFkn9PfFr8NcL2yZMhtLH9xWp5e7mkxDxgqC1sGpTpUHaJIaLiepJ5wNrdpfCWPyRw0s8JxlDQDVs+PS2g5heC+kF9E3uUQhGm8fAPe+1CFrXuT3JpSSTDSw8XP0omYvY4k5d0yVjdy78+qiMK3d/wVMzjLMCjbNeNryPisHwgU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625287; c=relaxed/simple; bh=WsuVbqWh+bJy53QzNA91ImR6ThKQ4xu7bsPgOfOgj+U=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=DpbD5QCxKK7Lds8D6SpdHEhIKPpC2m3inxUAoWxXGc5IQGza9D0nF9/n3dEhQQQF7uywaTBNkmYVjKdcXbkTRXtUyemvWtZcDdZ/UguS1nVsF0j0S9Ykwrt6+/TqlYx9Dmi/JeJvLkO5c5yZXOA0asVyZhKI2RpQ1LC7L4GHAic= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XCLTuQY9; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XCLTuQY9" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6fedcc61536so52224777b3.0 for ; Fri, 21 Mar 2025 23:34:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625280; x=1743230080; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=J5Ova8GyhEIII8UmDuNM8qm3xriqxYZGD7PNX7po890=; b=XCLTuQY9M8AAx8wjF5/GXWB5ez9Qwn0DRsTb3/7exzz8EIDWuOdYdZsGhygkVOHdds r2MDYzcSmAN9xpgOGJCzscm6EsghSV2N1VPjrCXF6RwQh2zRSwAdglUqiRyWMxp7nUV6 KfBxIA9vsXWfPFcuT3W3IqimcZRdJXzMhb8294JRCp2RtUUtu8MN1l+UyPD2clpakmgV AZ6UkGmqWXu68PID/CYB4fTx6LHUYmJvepfoTAcKGNKaVBRCjwXi38y6eHWhEN/RQ01a 9Cm4UKD8zW9EdeNwI/2/9K1leN8l/QApe6DqHyerfJVgWiyWyik3+SpQM2g/MfeAuJ/k ohqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625280; x=1743230080; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=J5Ova8GyhEIII8UmDuNM8qm3xriqxYZGD7PNX7po890=; b=mqT1m7a49Rjk5qneBMGzsuZF7GegaQh9bUlRLnrQPqOrgrO0tgP/ziA2VXG9dx5YOV GyljuiuI3ly3gwd3Noe13Mi6HHnm+PfcerOruSBuS8gm+6DDsdQun7I4MkYcMFhZv/gV LHOwbbhHm+mmXeISZSZkpwFEKWX0JMF9WapigOhQHiXeg/wT/fQI8XttWRAyRHbd2z42 InaemPaSw1+3C9fS7ebLB/cPYjLqakc5ZDX9RbkJUK8TEhaaJz7A9Gvq8cYPVY0RqsjT +L2OhmPcWkt9BD42ciUKH2QqBrchu3vL0unQwI7p9Y3MJsnu950m78z5qDF90LV72aWO Cyhg== X-Forwarded-Encrypted: i=1; AJvYcCVuIodtwTwDJzjaRMyA0qzeRCm5kIqQbDFJ9lL8q7ZU6pJuhmwSminaj2+EMnUxQHJHzrRTZsMQ/xWPXNE=@vger.kernel.org X-Gm-Message-State: AOJu0YwGjgn6yjra6Dh7QQPqFfNUtCoaBNfjmBFavobW0Ux793Dql4BI QGXnIMkH76SQM9MavyKPkO6j55JNJhBEu7Qc2B1l83Qd/cOQYXePmQm4ZPwkCHVQ6Vp6bRbVPvr eyiEMqQ== X-Google-Smtp-Source: AGHT+IEMm3uvd84hOgt2pqEJykKC904DqPykomjqgCTB/orY7S8AJKwZ/+Sv8wov7mAomKI/VW5xCk1CXkj6 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:6902:4ea:b0:e5e:1496:7371 with SMTP id 3f1490d57ef6-e66909c0f27mr18192276.0.1742625280246; Fri, 21 Mar 2025 23:34:40 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:35 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-8-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 07/35] perf vendor events: Update broadwellx metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/broadwellx/bdx-metrics.json | 268 +++++++++--------- 1 file changed, 133 insertions(+), 135 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json b/t= ools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json index 8016202bad1f..5d06a3f72be2 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -345,8 +345,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -354,7 +354,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -364,8 +364,8 @@ "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { @@ -376,7 +376,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { @@ -385,8 +385,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -394,8 +394,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -404,8 +404,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { @@ -414,7 +414,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -422,26 +422,26 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -449,18 +449,18 @@ "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -489,7 +489,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -497,8 +497,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -506,8 +506,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -515,8 +515,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -524,8 +524,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -535,33 +535,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -572,7 +572,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -593,11 +593,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -622,7 +622,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -640,7 +640,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -648,7 +648,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -656,7 +656,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -664,7 +664,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -672,7 +672,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -714,7 +714,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -842,14 +842,14 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -890,14 +890,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -908,14 +907,14 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" @@ -965,7 +964,7 @@ "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -974,15 +973,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1008,14 +1006,14 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, @@ -1024,8 +1022,8 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -1033,8 +1031,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { @@ -1043,8 +1041,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { @@ -1053,8 +1051,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1062,18 +1060,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1091,8 +1089,8 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { @@ -1101,8 +1099,8 @@ "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -1118,10 +1116,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -1130,7 +1128,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -1142,7 +1140,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { @@ -1159,8 +1157,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1169,7 +1167,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { @@ -1177,8 +1175,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1187,7 +1185,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1196,7 +1194,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1232,7 +1230,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1241,7 +1239,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1259,43 +1257,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1304,8 +1302,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1313,8 +1311,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { @@ -1334,7 +1332,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1342,8 +1340,8 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1351,7 +1349,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1360,8 +1358,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1369,8 +1367,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1379,8 +1377,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1396,7 +1394,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1405,8 +1403,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B9CA1C84C7 for ; Sat, 22 Mar 2025 06:34:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625294; cv=none; b=QRm1TsVLRHrAcAtrkh8dp0Gxzqmi3E24gQC/av9aI28kdXm6lXSxx3mKpoHfbWRypPl7YVpbMRrNVgKHcD1lrQ3CtxUMF8K5D6h/bO1DsgJVomxk5+jPMe+X1fVyUjVrADfd/dRD54lbyHeMEa7CxnzZyB+C32bwsqd7sx2TYeA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625294; c=relaxed/simple; bh=EGxdMtxUGHyCcu+Hd0Y6QDl5Ws+H7DHDAF99GmO0dnc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=kAtFlyr+fmyjUsfMvCTGg6finpLFnFfR4CFFlLhrC5wtgcHfrv49m5MVFdtWAlSdwp+PKIwE2qFE+OQNRaSy14Zh3J7TXuU//GyCIdRDxZ2HmbrUkdbnEpoF/pG5W1jwPoSUaqB1uclbSzEzPmrwAFksTqhYIbxxJUoMiXZ8tQw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SXwZSk65; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SXwZSk65" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fed889e351so31235847b3.3 for ; Fri, 21 Mar 2025 23:34:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625283; x=1743230083; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=n6Q8Z4WlSxJWH025xs6NtydM/dPbwQWyqyzyunuITcw=; b=SXwZSk65GO7yr6uXRh1MqRh/BTE5pvboTS42dVjo1WrYRp1jWGiSnJmaYxnZBcjw0m kk0SaVB1AI1Z3HWpDznJtnp1PeGL1Pgugf5mYg6NyOUN5eI+4Sabu3L2qeTIgQcsjceG yzf+yl4j4h+y8ToLch1KRpvTX2eEi2k7OkPUVdD9Bk75SUD8bqaoUQs7yvyKV5mgKcgb G9XlZFuJrNArdV5RIz9apg+vPHieuz7QF04mckE2gBQmNzcYeQwuUEUsz7ODR+tBVPWJ 9JhDBjChejz5TgDj8ZquvDyAHWe9ab3wbdFUHrzXqUCMtPR6a1TvNRDTrgVHDlbzn8qR D5MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625283; x=1743230083; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=n6Q8Z4WlSxJWH025xs6NtydM/dPbwQWyqyzyunuITcw=; b=UCiLwSDD0mnJv0sw9hevMmZlcjO7qi2/I370Ag+9Gn4z71CGS7lvEhLDLmGVKzoYOh rEv9+VM8tMI+uXs+XxunIEeHx9sYopCt/Z7IOSVQzAKl6CNIqvopGAhG/1spktmU5eYD /WuUb21bSuBVwXGpkHGAQwp1hobgK4T77lXS3GbXP40L00FnbcTFe28z/YUBsIQjNTgn zrogTHhxk0rey9lg0dKQxPOFT8sg7WQRctehAI5y7ZU5qSBUAkv7Z6mE+vjJ0yuo5v+4 pb7OemFNGwl7Kdf9yy2jl/U+JzI2ZBpLfk6GurgOW5D3J5MSJZTZ+jG7+U7uhxwqR6YE qpQw== X-Forwarded-Encrypted: i=1; AJvYcCWgmcvgewwqwmKYMctKUKDOpF6OjqVthGwmtGXkiLj+Oku+aJ/uuLNJdy1GRH9ntBEDabDUznqad9DoMdA=@vger.kernel.org X-Gm-Message-State: AOJu0YxfKYbvXAO1/UOggL2Kkc37Nvc5ZiVtgs2Ew6T04kTiWN2Ly+7A FgtuV7v2+2g60sM8Mh7H3dQ1JhedbWoflxGO6hoSD4u5lA4hip6S+tIQ9w4EqkR6rQVstimzj2c yERwBPA== X-Google-Smtp-Source: AGHT+IF5xKoQths+gQTaMCLpdxT1VwcRPthMs2d8rpT8zRRRM9zqFPuNzZVpqw8IuTIUaz+xe1K3uSSwbcuT X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a0d:d2c7:0:b0:6eb:ac7:b4bc with SMTP id 00721157ae682-700babeb802mr239577b3.2.1742625282819; Fri, 21 Mar 2025 23:34:42 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:36 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-9-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 08/35] perf vendor events: Update cascadelakex events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/cascadelakex/cache.json | 404 ++++++++++++++++++ .../arch/x86/cascadelakex/clx-metrics.json | 389 +++++++++-------- .../arch/x86/cascadelakex/other.json | 404 ------------------ 3 files changed, 598 insertions(+), 599 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json b/tools= /perf/pmu-events/arch/x86/cascadelakex/cache.json index 8bad700ff8ea..d113c14aa7c9 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json @@ -1,4 +1,78 @@ [ + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", + "SampleAfterValue": "2000003", + "UMask": "0x20" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDM", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDM", + "SampleAfterValue": "2000003", + "UMask": "0x10" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITI", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITI", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", + "SampleAfterValue": "2000003", + "UMask": "0x40" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDM", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDM", + "SampleAfterValue": "2000003", + "UMask": "0x8" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", + "SampleAfterValue": "2000003", + "UMask": "0x4" + }, + { + "BriefDescription": "Counts number of cache lines that are dropped= and not written back to L3 as they are deemed to be less likely to be reus= ed shortly", + "Counter": "0,1,2,3", + "EventCode": "0xFE", + "EventName": "IDI_MISC.WB_DOWNGRADE", + "PublicDescription": "Counts number of cache lines that are droppe= d and not written back to L3 as they are deemed to be less likely to be reu= sed shortly.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, + { + "BriefDescription": "Counts number of cache lines that are allocat= ed and written back to L3 with the intention that they are more likely to b= e reused shortly", + "Counter": "0,1,2,3", + "EventCode": "0xFE", + "EventName": "IDI_MISC.WB_UPGRADE", + "PublicDescription": "Counts number of cache lines that are alloca= ted and written back to L3 with the intention that they are more likely to = be reused shortly.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, { "BriefDescription": "L1D data line replacements", "Counter": "0,1,2,3", @@ -2343,6 +2417,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all demand code reads have any respons= e type.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.L3_HIT.ANY_SNOOP OCR.DEMAND_CODE_RD.L3_HIT.ANY_SNOOP", "Counter": "0,1,2,3", @@ -2703,6 +2787,116 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80400004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80400004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDE= D", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HITM_OTHER_CORE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HITM_OTHER_CORE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1000020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_F= WD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.SNOOP_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x200020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand code reads", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80020004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads have any response ty= pe.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.L= 3_HIT.ANY_SNOOP OCR.DEMAND_DATA_RD.L3_HIT.ANY_SNOOP", "Counter": "0,1,2,3", @@ -3063,6 +3257,116 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80400001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.SNOOP_NONE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80400001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDE= D", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HITM_OTHER_CORE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HITM_OTHER_CORE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1000020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_F= WD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.NO_SNOOP_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.SNOOP_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x200020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80020001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) have any= response type.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.L3_HIT.ANY_SNOOP OCR.DEMAND_RFO.L3_HIT.ANY_SNOOP", "Counter": "0,1,2,3", @@ -3423,6 +3727,106 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80400002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NONE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80400002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.ANY_SNOOP", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.ANY_SNOOP", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F80020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HITM_OTHER_CORE", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HITM_OTHER_CORE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1000020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.NO_SNOOP_NEEDED", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.NO_SNOOP_NEEDED", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs)", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.SNOOP_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x200020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all demand data writes (RFOs)", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.SNOOP_NONE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x80020002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts any other requests OCR.OTHER.L3_HIT.AN= Y_SNOOP OCR.OTHER.L3_HIT.ANY_SNOOP", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b= /tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index 5729b93a9c68..6485b565acbc 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -313,12 +313,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -330,7 +330,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -341,7 +341,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -351,12 +351,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -371,7 +371,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -379,7 +379,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency)) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -387,22 +387,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -410,7 +410,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -418,7 +418,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -426,7 +426,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -438,10 +438,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -463,8 +463,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -472,8 +472,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -481,7 +481,7 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -490,7 +490,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -498,33 +498,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DE= MAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + (47.5 * tma_info_= system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.= DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER= _CORE_FWD))) + 44 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRE= D.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -535,25 +535,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE /= (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT= _OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_= DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE= + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -562,7 +562,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -572,7 +572,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -582,7 +582,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -590,27 +590,27 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -619,18 +619,18 @@ "MetricExpr": "(110 * tma_info_system_core_frequency * (OCR.DEMAND= _RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_= info_system_core_frequency * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.P= F_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Related metrics: = tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data_sha= ring, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tm= a_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -640,7 +640,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -650,7 +650,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -670,7 +670,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -679,7 +679,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { @@ -687,17 +687,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -706,7 +706,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -715,7 +715,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -724,7 +724,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -735,35 +735,35 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -774,11 +774,11 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -803,7 +803,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -812,7 +812,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -820,10 +820,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -852,7 +853,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -877,14 +878,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -897,20 +898,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -946,7 +947,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -961,11 +962,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -973,7 +974,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -981,7 +982,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -989,7 +990,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -997,7 +998,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1005,7 +1006,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1061,7 +1062,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1248,8 +1249,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1270,12 +1271,12 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1331,14 +1332,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1356,7 +1356,7 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1386,7 +1386,7 @@ "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1394,7 +1394,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1402,7 +1402,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1436,7 +1436,7 @@ "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1445,15 +1445,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1479,15 +1478,15 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1495,7 +1494,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1504,17 +1503,17 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1523,7 +1522,7 @@ "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1532,17 +1531,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1550,18 +1549,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1579,7 +1578,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1587,39 +1586,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1628,7 +1627,7 @@ "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1645,10 +1644,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1657,7 +1656,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1669,11 +1668,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1695,7 +1694,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1709,12 +1708,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1722,7 +1721,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1732,7 +1731,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1740,8 +1739,8 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP_PS", "ScaleUnit": "100%" }, { @@ -1754,19 +1753,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1775,7 +1774,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1784,7 +1783,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1820,7 +1819,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1829,7 +1828,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1846,8 +1845,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1855,8 +1854,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1864,7 +1863,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1873,35 +1872,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", + "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { @@ -1919,7 +1918,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1928,8 +1927,8 @@ "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: ROB_MISC_EVENTS.P= AUSE_INST", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, { @@ -1939,7 +1938,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1947,8 +1946,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1956,7 +1955,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1965,8 +1964,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1974,8 +1973,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1984,8 +1983,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2001,7 +2000,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -2009,31 +2008,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2041,7 +2040,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -2050,8 +2049,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/other.json b/tools= /perf/pmu-events/arch/x86/cascadelakex/other.json index f25693b17b8b..51833bce994e 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/other.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/other.json @@ -35,62 +35,6 @@ "SampleAfterValue": "200003", "UMask": "0x40" }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", - "SampleAfterValue": "2000003", - "UMask": "0x20" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDM", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDM", - "SampleAfterValue": "2000003", - "UMask": "0x10" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", - "SampleAfterValue": "2000003", - "UMask": "0x2" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITI", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITI", - "SampleAfterValue": "2000003", - "UMask": "0x1" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", - "SampleAfterValue": "2000003", - "UMask": "0x40" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDM", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDM", - "SampleAfterValue": "2000003", - "UMask": "0x8" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", - "SampleAfterValue": "2000003", - "UMask": "0x4" - }, { "BriefDescription": "Number of hardware interrupts received by the= processor.", "Counter": "0,1,2,3", @@ -100,24 +44,6 @@ "SampleAfterValue": "203", "UMask": "0x1" }, - { - "BriefDescription": "Counts number of cache lines that are dropped= and not written back to L3 as they are deemed to be less likely to be reus= ed shortly", - "Counter": "0,1,2,3", - "EventCode": "0xFE", - "EventName": "IDI_MISC.WB_DOWNGRADE", - "PublicDescription": "Counts number of cache lines that are droppe= d and not written back to L3 as they are deemed to be less likely to be reu= sed shortly.", - "SampleAfterValue": "100003", - "UMask": "0x4" - }, - { - "BriefDescription": "Counts number of cache lines that are allocat= ed and written back to L3 with the intention that they are more likely to b= e reused shortly", - "Counter": "0,1,2,3", - "EventCode": "0xFE", - "EventName": "IDI_MISC.WB_UPGRADE", - "PublicDescription": "Counts number of cache lines that are alloca= ted and written back to L3 with the intention that they are more likely to = be reused shortly.", - "SampleAfterValue": "100003", - "UMask": "0x2" - }, { "BriefDescription": "OCR.ALL_DATA_RD.ANY_RESPONSE have any respons= e type.", "Counter": "0,1,2,3", @@ -668,336 +594,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all demand code reads have any respons= e type.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80400004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80400004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE_= RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDE= D", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HITM_OTHER_CORE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HITM_OTHER_CORE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1000020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x800020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_F= WD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x400020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads OCR.DEMAND_CODE= _RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.SNOOP_MISS", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x200020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand code reads", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SUPPLIER_NONE.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80020004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads have any response ty= pe.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80400001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.SNOOP_NONE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80400001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.P= MM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDE= D", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HITM_OTHER_CORE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HITM_OTHER_CORE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1000020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x800020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.HIT_OTHER_CORE_NO_F= WD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x400020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads OCR.DEMAND_DATA_RD.= SUPPLIER_NONE.NO_SNOOP_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.NO_SNOOP_NEEDED", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.SNOOP_MISS", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x200020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SUPPLIER_NONE.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80020001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) have any= response type.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80400002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NONE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80400002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEMA= ND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.PMM_HIT_LOCAL_PMM.SNOOP_NOT_NEEDED", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.ANY_SNOOP", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.ANY_SNOOP", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F80020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HITM_OTHER_CORE", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HITM_OTHER_CORE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1000020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_FWD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x800020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.HIT_OTHER_CORE_NO_FWD", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x400020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs) OCR.DEM= AND_RFO.SUPPLIER_NONE.NO_SNOOP_NEEDED", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.NO_SNOOP_NEEDED", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs)", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.SNOOP_MISS", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x200020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all demand data writes (RFOs)", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SUPPLIER_NONE.SNOOP_NONE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x80020002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts any other requests have any response t= ype.", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B951D1C84D6 for ; Sat, 22 Mar 2025 06:34:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625288; cv=none; b=NZL3yergvBDa/mj6NCjwX1dPKavJ//4mCrQw5ZQ4OLxV2sOFY1DfH8kO7+nlMnopvVxoXqur5hYT7OiH5dn9cNmZrbAIfh/DyLTqhoJ18AHGeHA8DC6vBSal+v7wLKz3aFXN1uvFNluRy15RZGaT4EE7torLLBLkFK+hmM/lYug= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625288; c=relaxed/simple; bh=jj7cgVY1dHCdbl7CToIOpx+DtAOgzXi+ZwBseYtASCc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=bXawRuBru/ljEGJ7iq7fN1drpXLogDjsIDBYBmxSnhJyiPZ94K2Q+99v9TA4oBRw/lWlxi2H/nxaEqrQcW22FFrn4DKuRhs2J8N05Zqg9lC3CwNHGBMmjs8b7z8R5jzjteMCcVj3xssocCqf5/L/KTX0tX5GLp1GJ7aVV0hSXpo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xn3hfFLM; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xn3hfFLM" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fedcc61536so52225237b3.0 for ; Fri, 21 Mar 2025 23:34:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625286; x=1743230086; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=6tXKzMdne0PcZ1yEwuy/OWzqt9sFMlDxjtYmEokV0uc=; b=xn3hfFLMXuPpfentevtXwmKXhrVoReKOrymP9vLt970GSO8cl4NUYJOEVwJ/B/AxAg buFxOtFG43DYu21WnJWcAKDkkvw4VlN9ShWKOeorey1VEvMoyS6Sr9HE3DMryurGbntY LM3a83svUNi3ug6EhjFHszQx0x3EVY8i2n835MbxdtkrEvNxGncsTHdeQTJk6Dd8pFNp ON3Krt4wJP9TRE/1ZsP8t1lRQ+ac9/KD+nJ2DpGtobbGjUViybzhZvRJDirTiCt9GFZj 27b+/hi8pMPrGvXes1aVP+bmiQMfZellLisyBN0YRgHMVkBJP80ZTeiXVAKGDPi7Yfec IlMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625286; x=1743230086; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6tXKzMdne0PcZ1yEwuy/OWzqt9sFMlDxjtYmEokV0uc=; b=An5ZQPTFrcQGe50AT9LErJQFVjoUmn4i2dccoMEpCLQwU0k9S9I/VckuYFXB7trkG/ 4BtgAEjAO2kdrkQHkQhZW4MBPj74q7klCrnCrqii2vF1yCaBtjf5skHu00wAGgdcwcTe /VFZf2QeYPNuwYj1yX6At/ns6fFpHsf8elyKItARN/nuL6php1zHbC8BUHsamHojq6VF WU0+3g0X3xTE5H3k/ezuczRMEmdGO5pEjgpjDSRw71O+MfgYz6uQhGCxDmCQRGdBTS7M pRZlwTpO2I1WzMtcMtGI7NGKo0go38ZC6AdQWOIZhPtjf4DbyqtUlttWvorbwvrWZp6R yMVA== X-Forwarded-Encrypted: i=1; AJvYcCU9JG1UnC/FX2+HZa0ZqIxnt5Vwz+qlZxMFLPA37/6Ser9jvl6PQ+z/AycUor/eD81FquQ8ekpvbcfABZk=@vger.kernel.org X-Gm-Message-State: AOJu0Ywd8GsybHJWmH0wt3la67fM0jN2E5RydSajEiVCb0pgVIMjK9O4 QYhBinyfCkyFxaBIFSuZ97kvv0hL0m1CVdww5UMIwYg8bcVkpXbaKA/Bxq8wKrvqszJm/bKdcYs YLA3xew== X-Google-Smtp-Source: AGHT+IHyfRIfybU/LYQBgz4YWsHHUZmvz79hZv83aHpwxbCiQUr7lQ3Mqf02d36mp+6b81C+v+/67SiWoESp X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:6902:150:b0:e5b:33d1:cbf7 with SMTP id 3f1490d57ef6-e6690eed614mr18568276.4.1742625285656; Fri, 21 Mar 2025 23:34:45 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:37 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-10-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 09/35] perf vendor events: Update clearwaterforest events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic of OCR.DEMAND_DATA_RD.ANY_RESPONSE and OCR.DEMAND_RFO.ANY_RESPONSE to be cache. Signed-off-by: Ian Rogers --- .../arch/x86/clearwaterforest/cache.json | 20 +++++++++++++++++ .../arch/x86/clearwaterforest/other.json | 22 ------------------- 2 files changed, 20 insertions(+), 22 deletions(-) delete mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/other.j= son diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/cache.json b/t= ools/perf/pmu-events/arch/x86/clearwaterforest/cache.json index 875361b30f1d..17f8bfba56bc 100644 --- a/tools/perf/pmu-events/arch/x86/clearwaterforest/cache.json +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/cache.json @@ -140,5 +140,25 @@ "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", "SampleAfterValue": "1000003", "UMask": "0x6" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/other.json b/t= ools/perf/pmu-events/arch/x86/clearwaterforest/other.json deleted file mode 100644 index 80454e497f83..000000000000 --- a/tools/perf/pmu-events/arch/x86/clearwaterforest/other.json +++ /dev/null @@ -1,22 +0,0 @@ -[ - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - } -] --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 180601C84D5 for ; Sat, 22 Mar 2025 06:34:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625292; cv=none; b=JkcuyRc8UHzuchWOjz4gCxGwpoex/CodrG8Otd0fQCaAoJQ88f7ATlPZAYTnXit4cRw0WTrNwoSrfMYv0IszER+EmF+t3jl3WndWOgixx6s/sYCDhe/lwfv3Q6b8Dl6fwNA9Kobc8fTzZ4uIwB9gVur7gmrMdVspqlqwsGb63KY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625292; c=relaxed/simple; bh=p7Mb6vIPPHoTDt3eSFoj41xhRJUybxfZh++R1hxEinA=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Armbw9SjtIMD+WdEl0lsLkwfJGU/1ewKxdp+62OI3LL1ZnD8zhbP03VlHCwbir5oBApAFErFO1LUD5t7jSBXCbO8Ir0Hx1PsiUrhR682ZZX1KQAI2rU3/M+Nu4dU/BU8tIaZvuzOy5Gxyczy1xIg0ftcOjH/eHbSe3OW9bdPKhk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Vle9+LlS; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Vle9+LlS" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e54d9b54500so3795291276.3 for ; Fri, 21 Mar 2025 23:34:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625288; x=1743230088; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=0YeSs3HV21hPqxME9Y/uPxwXurItuHzesMa147bUg4c=; b=Vle9+LlSeDD/MqIbzwrtDQ3pNNjiqy3RGekqyraaX9Po6N0/XkCb7mTkBW/d2a2C5A /I4YhHqLpNVEHcTNnB7MiExNtH0iP6l9OVcCFqAy1WJ3CRt5VzGmE+0K3Xl0QUr+y+tS WLJ3hO6TuMITR8cK5nRjeDe9NTS5Kn4EMVXA0ZgDMOrvaN0OkLHr35llx4CaPedfQhZP 0P3KK/mtM+LTD4hcPziWo35+YXnEVRTXQfkLIE4vmigovHl7W2tqBRcg0pvBi4XoHhnR n+JpahncpkiCpm9P0a+Xh7XJDOpdwGXm8lxmEPLzrXjo7azWtFHGUjExvBXl2eWxwmRY HHUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625288; x=1743230088; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=0YeSs3HV21hPqxME9Y/uPxwXurItuHzesMa147bUg4c=; b=ODete2+AoNHP3Yq1Oz5WPorbK6w4UxqpXhA3WOK4j1nXXsva/DzvJv+ebpR6tXc07R 8+6wPovP0d5G3qiZYVuD+V/B2yaQ1rPXr5jauxOVgBazIJ2ahDo8O5t2+Vi3h6qUHEsq 1VJLkw95OllS1bosBcLjmxx5+WdjHOZ9AUmwtfgLOhP6zA53ROzd4Zmgek9K+HUzVeFV /J06YcgVnd4XpQGlmGaUP3M5WwsvV32I/lpQ6K+WXAS/7WhNLgl5npi7NF58Tqvf08KU 9dpwBysYMVPvxeKqUZaW5Y5Knr4SkNlPG9sz0dlQWFHREP9AxB388aNJdtUibJURhVKA AYWw== X-Forwarded-Encrypted: i=1; AJvYcCWyFo8+O98GK+M4r6ihLrzNnBebfBUvfp2ODTuopVVhlmVrbyBdFUGOQ1hL8aE+2WZJ0jApVY49rUS0zQk=@vger.kernel.org X-Gm-Message-State: AOJu0YyNLRar6V258sE+Wb1JLBfmuFYzQkJweJ5duxY/JwX6Bh2DMOIJ Oca9qmj79Jv3RBiCfA+TZYLPv1BP3VZNTc35oyo4MaKW30Vi3x8SpmvxIS6idcT7/f2bP3SxcCv ub+zxpg== X-Google-Smtp-Source: AGHT+IFkFFjPxOXC/7mpxpt7UBovKLcy3XGxqctyZSKkAxXUCbhexPCz0PReL+b27LQOefJIYNeXSqnXEHuJ X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:aac2:0:b0:e63:6b57:2414 with SMTP id 3f1490d57ef6-e66a4d34ff3mr23600276.1.1742625287935; Fri, 21 Mar 2025 23:34:47 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:38 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-11-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 10/35] perf vendor events: Update elkhartlake events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and memory. Signed-off-by: Ian Rogers --- .../arch/x86/elkhartlake/cache.json | 192 +++++++++ .../arch/x86/elkhartlake/memory.json | 202 +++++++++ .../arch/x86/elkhartlake/other.json | 394 ------------------ 3 files changed, 394 insertions(+), 394 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/elkhartlake/cache.json b/tools/= perf/pmu-events/arch/x86/elkhartlake/cache.json index 7882dca9d5e1..1bb42acf1d48 100644 --- a/tools/perf/pmu-events/arch/x86/elkhartlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/elkhartlake/cache.json @@ -357,6 +357,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -367,6 +377,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have an outstanding request. Returns the number of cycles unt= il the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.COREWB_M.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8003000000000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -427,6 +457,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -487,6 +527,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve an outstanding request. Returns the number of cycles until the response = is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.L3_HIT", "Counter": "0,1,2,3", @@ -553,6 +614,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache.", "Counter": "0,1,2,3", @@ -613,6 +695,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have an outstan= ding request. Returns the number of cycles until the response is received (= i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -623,6 +715,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that have any type of resp= onse.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that were supplied by the = L3 cache where a snoop was sent, the snoop hit, and modified data was forwa= rded.", "Counter": "0,1,2,3", @@ -633,6 +735,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -693,6 +805,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have an outstanding request. Returns th= e number of cycles until the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -753,6 +885,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -813,6 +955,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have an outstanding request. Returns the numb= er of cycles until the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.L1WB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -823,6 +985,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.L2WB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x2000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -843,6 +1015,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -903,6 +1085,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e an outstanding request. Returns the number of cycles until the response i= s received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that were supplied by= the L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/elkhartlake/memory.json b/tools= /perf/pmu-events/arch/x86/elkhartlake/memory.json index 34306ec24e9b..260a488540bb 100644 --- a/tools/perf/pmu-events/arch/x86/elkhartlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/elkhartlake/memory.json @@ -25,6 +25,16 @@ "SampleAfterValue": "200003", "UMask": "0x4" }, + { + "BriefDescription": "Counts all code reads that were supplied by D= RAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.ALL_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000044", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all code reads that were not supplied = by the L3 cache.", "Counter": "0,1,2,3", @@ -45,6 +55,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all code reads that were supplied by D= RAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.ALL_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000044", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -65,6 +85,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -85,6 +115,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -105,6 +155,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.L3_MISS", "Counter": "0,1,2,3", @@ -127,6 +198,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3", @@ -147,6 +239,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -167,6 +269,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -187,6 +299,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -207,6 +339,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -227,6 +379,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -317,6 +479,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -337,6 +509,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that were not supplie= d by the L3 cache.", "Counter": "0,1,2,3", @@ -357,6 +539,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.UC_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100184000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts uncached memory reads that were not su= pplied by the L3 cache.", "Counter": "0,1,2,3", @@ -377,6 +569,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.UC_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100184000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts uncached memory writes that were not s= upplied by the L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/elkhartlake/other.json b/tools/= perf/pmu-events/arch/x86/elkhartlake/other.json index 57613207f7ad..35cdbfa617e7 100644 --- a/tools/perf/pmu-events/arch/x86/elkhartlake/other.json +++ b/tools/perf/pmu-events/arch/x86/elkhartlake/other.json @@ -116,26 +116,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all code reads that were supplied by D= RAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.ALL_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000044", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all code reads that were supplied by D= RAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.ALL_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000044", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts all code reads that have an outstandin= g request. Returns the number of cycles until the response is received (i.e= . XQ to XQ latency).", "Counter": "0,1,2,3", @@ -146,180 +126,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.COREWB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have an outstanding request. Returns the number of cycles unt= il the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.COREWB_M.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8003000000000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve an outstanding request. Returns the number of cycles until the response = is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have an outstan= ding request. Returns the number of cycles until the response is received (= i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3", @@ -330,146 +136,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that have any type of resp= onse.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have an outstanding request. Returns th= e number of cycles until the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have an outstanding request. Returns the numb= er of cycles until the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.L1WB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.L2WB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x2000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts miscellaneous requests, such as I/O ac= cesses, that have any type of response.", "Counter": "0,1,2,3", @@ -500,46 +166,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e an outstanding request. Returns the number of cycles until the response i= s received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -560,26 +186,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.UC_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100184000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.UC_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100184000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts uncached memory reads that have an out= standing request. Returns the number of cycles until the response is receiv= ed (i.e. XQ to XQ latency).", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B8F5C1D618E for ; Sat, 22 Mar 2025 06:34:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625306; cv=none; b=ReSdVJvvgBlVM13dGIS+hOvxnHQv2M97JlI3ygY70ZWOQAH6+Wg49N2H9OFcam+CvYIVvCcnVLQpXvtoQe00hhCwmKTd8xDnErAn7/73xlQgIO9xBTSAT2rw52FLzdxwbrpTm6sDXdyoq+nh22Dyz45lSNOlLfEsYXnVthmxVug= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625306; c=relaxed/simple; bh=nb+8fLuWTinOKP0A7sCwarDMUI5I2Cc/HjNA02PyftU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=XCe39w2SdJDBUyJfNIH5dpW/7psJH7jhBYIbxw6jJU8vQEb5RaHTtsbwVhNWlthrHF75dTJbLaygUFGbGhU4aJ0mpVKFlyLHgm3EInDKAk8vWIP00cIcXnJ2FdOeWGWBdDf8FWogMALeyjsyzqWoZVt7GG6Dnr2vJSEmhaFZ4u0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=fnuQb2py; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="fnuQb2py" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ff0787203cso31117537b3.0 for ; Fri, 21 Mar 2025 23:34:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625290; x=1743230090; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=o8pJbBgougiXAgYO3m/+fYfim8xB1NDfiBeZe3FEsv0=; b=fnuQb2pynDHGBzJOJ8tDzuP0680HXs3FTVjtRngzmS+bQY32c8AazQDd2i0vvZnO+m xhVJnGJneXheJB7EJADTMAYUthSbw4JXUJ8chng3Wekqf01kxIb1Ihh9i658g7bWvzrI qDG1kdg3Nsy+u3h7AmJ46TrQrICHFTF4rVh/IDWb/24pxRF6KzsaNSDEUgu+4MiV7JhH ZpIQFsAKY7t7kf59O8QVfprtJAD0K6bISGbLPplwrOZ3kzl968AG0DWEInLSfPje9xkb 3Mm0b+Pna/xQSgkwifopDuRpoooPwo8Vn5xnfeQVlBUbEuhmmhup5Js3GqOmzA84zDg9 YBoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625290; x=1743230090; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=o8pJbBgougiXAgYO3m/+fYfim8xB1NDfiBeZe3FEsv0=; b=Z7ESucz7BH4IdsXp3QLyP903FmgHpIvSVgqr66QEYA/tDPBzUOmUCntdAsFcWo/6hk +z5fy/sdF9nPSEVTqvGufRBYnMRlsZoNJd9kyXcTtkhrM1Elz8H91Lk48PRR9A28AAUq 3MFTUfGTycvHBNn7sWfkPM5aY3iseE0NwL/CJnSxzROwfGIQq+fBsJzPNVZgiK3AjcYb gswEo3RmairNCuU2Df+agTaTE/6XGbamWeX9Xl3RDBGKG/jvmL8Ru5SWr3c3mx9wrLrQ +BjN9cZR6IHFtsU4kwVm2iKmZcSQbu302kMXYN8wmeOkPRF6hMCl9HWK8DnZkeV+YqFl fQew== X-Forwarded-Encrypted: i=1; AJvYcCXBwE2KyvlZ14ah7oFaM+gzlxmHy0vI7KbmozvK+Z3adcS0UT7a4HvX/DSwQktKkZUGWYwn4k7gPL8rlE0=@vger.kernel.org X-Gm-Message-State: AOJu0YzyADVpIjXYl0nQO6RjYsPNdLrOne3zLZj+/72yMSZSMWf75rHF aN7ayNPFH0J2nAvAWMeWkWoc9eyAH7Sy9fz8+ei95irR892OsMiGmRHpfpyxsg1XKSXQLEtqi0I 8nvId7A== X-Google-Smtp-Source: AGHT+IFqRjO2SMByxod6BXRXqd7pNyUgkOFQ/vaWccGZQB+Fd0NZgCqHrDgoai0e0rXjoXPn4hMggLyXwV3T X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:d692:0:b0:e60:accc:702e with SMTP id 3f1490d57ef6-e66a4e56e75mr2526276.5.1742625290510; Fri, 21 Mar 2025 23:34:50 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:39 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-12-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 11/35] perf vendor events: Update emeraldrapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/emeraldrapids/cache.json | 100 ++++ .../arch/x86/emeraldrapids/emr-metrics.json | 471 +++++++++--------- .../arch/x86/emeraldrapids/memory.json | 170 +++++++ .../arch/x86/emeraldrapids/other.json | 328 ------------ .../arch/x86/emeraldrapids/pipeline.json | 58 +++ 5 files changed, 563 insertions(+), 564 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json b/tool= s/perf/pmu-events/arch/x86/emeraldrapids/cache.json index 3b0581151d63..0b2b36a30075 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json @@ -569,6 +569,16 @@ "SampleAfterValue": "1000003", "UMask": "0x3" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit in the L3 or were snooped from another co= re's caches on the same socket.", "Counter": "0,1,2,3", @@ -609,6 +619,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit in the L3 o= r were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -689,6 +709,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -729,6 +759,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts data load hardware prefetch requests t= o the L1 data cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L1D.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetches (which bring data = to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L2.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10070", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L3.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x12380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t hit in the L3 or were snooped from another core's caches on the same sock= et.", "Counter": "0,1,2,3", @@ -739,6 +799,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L3.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x90002380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10808", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC4477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit in the L3 or were snooped from another core's caches on the sa= me socket.", "Counter": "0,1,2,3", @@ -779,6 +869,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json = b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json index d3b51fa6ec1c..1c4301ca9892 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json @@ -300,7 +300,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -312,7 +312,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", + "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -320,12 +320,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -335,7 +335,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -351,12 +351,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -371,7 +371,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -379,7 +379,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependen= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_= bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma= _l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_fb= _full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tm= a_store_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_= latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_s= tore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)) + tma_mem= ory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store= + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming= _stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -387,22 +387,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms / (tma_ds= b + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + t= ma_fetch_bandwidth * tma_ms / (tma_dsb + tma_mite + tma_ms)) + 10 * tma_mic= rocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_bra= nch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes + = tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_info_= thread_clks * tma_ports_utilized_0) / (tma_amx_busy + tma_divider + tma_por= ts_utilization + tma_serializing_operation) + tma_microcode_sequencer / (tm= a_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_mic= rocode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -410,7 +410,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound= + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_sto= re_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -418,7 +418,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency + tma_streaming_stores - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -426,7 +426,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -438,10 +438,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -450,7 +450,7 @@ { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -463,24 +463,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -488,8 +488,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -497,24 +497,24 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -522,7 +522,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -530,32 +530,32 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -566,24 +566,24 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -592,8 +592,8 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -601,7 +601,7 @@ "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -611,7 +611,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -619,34 +619,34 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricExpr": "(170 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_MISS@offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core_frequ= ency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, @@ -667,7 +667,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -678,7 +678,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -696,7 +696,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -705,15 +705,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -721,8 +721,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -730,8 +730,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -739,8 +739,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -748,8 +748,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -757,8 +757,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", "ScaleUnit": "100%" }, { @@ -769,27 +769,27 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" }, { @@ -797,8 +797,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -809,28 +809,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -858,7 +858,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -866,7 +866,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -874,10 +874,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -938,11 +939,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -955,20 +956,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1005,13 +1006,13 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1029,7 +1030,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1037,7 +1038,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1045,7 +1046,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1053,7 +1054,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1061,7 +1062,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1069,7 +1070,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1077,7 +1078,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1132,7 +1133,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1269,7 +1270,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1334,21 +1335,21 @@ "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1376,8 +1377,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1398,18 +1399,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1472,14 +1473,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1490,7 +1490,7 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" @@ -1500,11 +1500,11 @@ "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", "MetricGroup": "LockCont;MemOffcore;Server;SoC", "MetricName": "tma_info_system_mem_irq_duplicate_address", - "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + "MetricThreshold": "tma_info_system_mem_irq_duplicate_address > 0.= 1" }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1538,7 +1538,7 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricExpr": "uncore_cha_0@event\\=3D0x1@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, @@ -1568,7 +1568,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1577,15 +1577,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1595,13 +1594,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1617,14 +1616,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 9" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1633,7 +1632,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%" }, { @@ -1641,8 +1640,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1650,8 +1649,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1659,8 +1658,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1668,7 +1667,7 @@ "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1677,7 +1676,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1686,7 +1685,7 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1695,7 +1694,7 @@ "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1704,17 +1703,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1722,19 +1721,19 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1751,7 +1750,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1759,39 +1758,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1800,7 +1799,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1816,19 +1815,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1837,32 +1836,32 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1883,7 +1882,7 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1897,17 +1896,17 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clks = / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1915,11 +1914,11 @@ }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1928,7 +1927,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1936,7 +1935,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1950,19 +1949,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1971,7 +1970,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%" }, { @@ -1980,7 +1979,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1989,7 +1988,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1998,7 +1997,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, = tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector= _256b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2006,8 +2005,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -2015,8 +2014,8 @@ "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -2024,7 +2023,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -2034,8 +2033,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%" }, { @@ -2044,32 +2043,32 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2082,7 +2081,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2091,8 +2090,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%" }, { @@ -2101,7 +2100,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2111,7 +2110,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -2119,8 +2118,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -2128,7 +2127,7 @@ "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -2137,8 +2136,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -2146,8 +2145,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -2155,8 +2154,8 @@ "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2173,7 +2172,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -2181,31 +2180,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2213,7 +2212,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2222,7 +2221,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2231,8 +2230,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json b/too= ls/perf/pmu-events/arch/x86/emeraldrapids/memory.json index 41d4120d4dae..981e573330cd 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json @@ -173,6 +173,16 @@ "SampleAfterValue": "1000003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the local socket's L1, L= 2, or L3 caches.", "Counter": "0,1,2,3", @@ -183,6 +193,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -193,6 +233,46 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were n= ot supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -203,6 +283,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t missed the local socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -223,6 +323,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -253,6 +363,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x733004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that missed the local= socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -273,6 +433,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/other.json b/tool= s/perf/pmu-events/arch/x86/emeraldrapids/other.json index c424facf1b95..df4019ff7883 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/other.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/other.json @@ -7,274 +7,6 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, - { - "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb7", - "EventName": "EXE.AMX_BUSY", - "SampleAfterValue": "2000003", - "UMask": "0x2" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC0002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts data load hardware prefetch requests t= o the L1 data cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L1D.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches (which bring data = to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L2.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10070", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L3.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x12380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L3.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x90002380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10808", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC4477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x70C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F33004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x733004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -285,66 +17,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.WRITE_ESTIMATE.MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFBFF80822", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "Deprecated": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.COUNT", - "Invert": "1", - "SampleAfterValue": "100003", - "UMask": "0x7" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", - "Counter": "0,1,2,3,4,5,6,7", - "Deprecated": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.CYCLES", - "SampleAfterValue": "1000003", - "UMask": "0x7" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json index 50cacfbbc7cf..c16b63979c55 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json @@ -367,6 +367,14 @@ "SampleAfterValue": "1000003", "UMask": "0x4" }, + { + "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb7", + "EventName": "EXE.AMX_BUSY", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", "Counter": "0,1,2,3,4,5,6,7", @@ -740,6 +748,56 @@ "SampleAfterValue": "100003", "UMask": "0x2" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "Deprecated": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.COUNT", + "Invert": "1", + "SampleAfterValue": "100003", + "UMask": "0x7" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.CYCLES", + "SampleAfterValue": "1000003", + "UMask": "0x7" + }, { "BriefDescription": "TMA slots where no uops were being issued due= to lack of back-end resources.", "Counter": "0,1,2,3,4,5,6,7", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F44B1D89F0 for ; Sat, 22 Mar 2025 06:34:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625300; cv=none; b=MthGaNMh/2npwNa5xmK4QrGcsoB8WWwAQSXrjJV2AMxnP8T1jfGAPHDNOKpHdm60r14HaCEvjs+nIj2C6/uSwZ8BIbVBStd7/gNtECpoJE/yaMd8QEJcxiZhPsoT1sLb/+UAQ9XHb26Ux5mDSmk7KobpLYP05r/NFCdrSJVwQcg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625300; c=relaxed/simple; bh=Jcm7UL8/lFDhQIlPhuAgLeKF3UATwKiIBixq/P6AT10=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=l4yUAnt4pyHnOsp0wWP5kICUEBN8G2qC3y26j5mc/iX8jeYqXJsksCt3jjXnX9CCcZa8snw9ODRHa7GB/B9aNQhg3IS+XH5+6jqkYU9UFrc6QGM9Qji6qzwTL7wzTu9KeP/czfePRLpxLCHbueMUzNwD4PCoPu5Go5Z+OAmbTf4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=nMOIIOlS; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="nMOIIOlS" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6ff4d1b7490so31197487b3.0 for ; Fri, 21 Mar 2025 23:34:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625293; x=1743230093; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=9YC3LRLM5jbvcC6M7msVEOfa40sR0TPKU7PhwtgnPbI=; b=nMOIIOlSe2lxtQcof+MOLvYUiGahhWDVQ8ldepWkHtQTuPL0QN/eSc1fX+9jLXOhBT k6X4m1N/61+e17fPZW07L6QxHeybTZxE85wz+cBXV9D1vEa6BTTvJ42qTbEaH3qxHMZ2 34wGi6v5VZl4nwZxc6QS8JgfUn4L/fOJzYC3eqIg84KgDpStEMcMGQNu26A916ind46a ZF2rdkShvpBb8pehG6Vsfqrx8auub64QfMdg0HCjsIhSlsHgRQCx/FpwORTIGDno911x ql4lvFjPYsLlw2aNUJwaO9f2pSXgB5kO0RgTQzUVCIDzNcjgnEpsuBwCkN+9Wul/Bufu XOXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625293; x=1743230093; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=9YC3LRLM5jbvcC6M7msVEOfa40sR0TPKU7PhwtgnPbI=; b=SExbWZ6kel5IcWnhxJ2+0O4Ajo6dq4yJX/q/2n8wkNQCdYA97tYTt8z//YfrVCjgpm IzgEB75bPbMbqKzLPU6DWtmwR4VI04pj7+S3aKQh0lWmkSVfhGNwoVu9uGS4Me/ihkSl WvmXa6dtddtC7Zr9Z1A9N+BXxdkgJcTGnma5LW0X9+I2Hb4sEf89ZR48MTLm01zmgAt5 9+DwvxuUZ9kFIfTccvJGvenoKt8mkhCDPGNU21Kd1ZJ8Kc8xjvfuZW+Oz2DCCXIIYeAD NT27ggnjFYQwOcfj+452qX1Dc/2D76B5OYJUWokAmYvwMN6cZerPcgeraSUy780LxWCs E9mA== X-Forwarded-Encrypted: i=1; AJvYcCUsy2wnlQnaKyzWm2sQ7icGiArP7rfhul0vALKJw5LNBf2WTply8sLSBMfUViffS53r+xes5jdUwT4Bp9g=@vger.kernel.org X-Gm-Message-State: AOJu0Yyf4Da1/cK8xDZTvDx9EnadkuE3RKIrp7RgP4ImkCr4j45IHsqc tAEO3/p48sPMY3TOpj5hKijay7ZsFKZfG2oq8wtQWaFeX5A4pZWnXAOK43sOBSX25xO8Hmylv2c vtJODuQ== X-Google-Smtp-Source: AGHT+IGp9tNlQInLk9ibxWo7NeBI0hnkwxP+9esoYxoZEWIqIyDUURlUWoCDAUowAR73QPpSGS2GxfukAFA8 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a81:ff09:0:b0:700:a6a1:694 with SMTP id 00721157ae682-700bacedc21mr278307b3.5.1742625293135; Fri, 21 Mar 2025 23:34:53 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:40 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-13-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 12/35] perf vendor events: Update grandridge events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.05 to v1.07. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/grandridge/cache.json | 150 ++++- .../arch/x86/grandridge/counter.json | 2 +- .../arch/x86/grandridge/frontend.json | 8 + .../arch/x86/grandridge/grr-metrics.json | 521 +----------------- .../pmu-events/arch/x86/grandridge/other.json | 28 - .../arch/x86/grandridge/pipeline.json | 51 +- .../arch/x86/grandridge/uncore-cache.json | 45 +- .../arch/x86/grandridge/uncore-memory.json | 338 ++++++++++++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 9 files changed, 582 insertions(+), 563 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/grandridge/cache.json b/tools/p= erf/pmu-events/arch/x86/grandridge/cache.json index 04802e254e51..21671c65d6dd 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/cache.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/cache.json @@ -1,4 +1,91 @@ [ + { + "BriefDescription": "Counts the number of L1D cacheline (dirty) ev= ictions caused by load misses, stores, and prefetches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x51", + "EventName": "DL1.DIRTY_EVICTION", + "PublicDescription": "Counts the number of L1D cacheline (dirty) e= victions caused by load misses, stores, and prefetches. Does not count evi= ctions or dirty writebacks caused by snoops. Does not count a replacement = unless a (dirty) line was written back.", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Exclusive state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.E", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Exclusive state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x4" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Forward state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.F", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Forward state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x10" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Modified state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.M", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Modified state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x8" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Shared state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.S", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Shared state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = evicted due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= evicted due to an L2 cache fill. Increments on the core that brought the l= ine in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = silently dropped due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= silently dropped due to an L2 cache fill. Increments on the core that bro= ught the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that r= esulted in a Hit from a front door request only (does not include rejects o= r recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "SampleAfterValue": "200003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of total L2 Cache Accesses = that resulted in a Miss from a front door request only (does not include re= jects or recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that m= iss the L2 and get BBL reject short and long rejects, per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.REJECTS", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", "Counter": "0,1,2,3,4,5,6,7", @@ -35,7 +122,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which hit in the LLC.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an ICACHE or ITLB miss which hit in the LLC. If the= core has access to an L3 cache, an LLC hit refers to an L3 cache hit, othe= rwise it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x35", "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_HIT", @@ -43,7 +130,7 @@ "UMask": "0x6" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which missed all the caches.= ", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an ICACHE or ITLB miss which missed all the caches.= If the core has access to an L3 cache, an LLC miss refers to an L3 cache m= iss, otherwise it is an L2 cache miss.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x35", "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_MISS", @@ -68,7 +155,7 @@ "UMask": "0x1" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC. If the cor= e has access to an L3 cache, an LLC hit refers to an L3 cache hit, otherwis= e it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x34", "EventName": "MEM_BOUND_STALLS_LOAD.LLC_HIT", @@ -76,13 +163,21 @@ "UMask": "0x6" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s. If the core has access to an L3 cache, an LLC miss refers to an L3 cache= miss, otherwise it is an L2 cache miss.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x34", "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", "SampleAfterValue": "1000003", "UMask": "0x78" }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled to a store buffer full condition", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.SBFULL", + "SampleAfterValue": "1000003", + "UMask": "0x80" + }, { "BriefDescription": "Counts the number of load ops retired that mi= ss the L3 cache and hit in DRAM", "Counter": "0,1,2,3,4,5,6,7", @@ -335,6 +430,33 @@ "SampleAfterValue": "200003", "UMask": "0x42" }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the second Level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11" + }, + { + "BriefDescription": "Counts the number of store uops retired that = miss in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12" + }, { "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", "Counter": "0,1,2,3,4,5,6,7", @@ -344,6 +466,16 @@ "SampleAfterValue": "1000003", "UMask": "0x6" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and modified data was= forwarded.", "Counter": "0,1,2,3,4,5,6,7", @@ -364,6 +496,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and modified data was= forwarded.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/counter.json b/tools= /perf/pmu-events/arch/x86/grandridge/counter.json index 9fd5d8ad6d3b..d9ac3aca5bd5 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/counter.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/counter.json @@ -37,6 +37,6 @@ { "Unit": "CHACMS", "CountersNumFixed": "0", - "CountersNumGeneric": 4 + "CountersNumGeneric": "4" } ] \ No newline at end of file diff --git a/tools/perf/pmu-events/arch/x86/grandridge/frontend.json b/tool= s/perf/pmu-events/arch/x86/grandridge/frontend.json index 7cdf611efb23..fef5cba533bb 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/frontend.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/frontend.json @@ -31,5 +31,13 @@ "EventName": "ICACHE.MISSES", "SampleAfterValue": "200003", "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "1000003", + "UMask": "0x4" } ] diff --git a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json b/t= ools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json index 2f9959c61718..3029022e4e94 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json @@ -69,7 +69,7 @@ }, { "BriefDescription": "Percentage of time spent in the active CPU po= wer state C0", - "MetricExpr": "tma_info_system_cpu_utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricName": "cpu_utilization", "ScaleUnit": "100%" }, @@ -213,525 +213,6 @@ "MetricName": "stores_retired_per_instr", "ScaleUnit": "1per_instr" }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", - "MetricExpr": "tma_core_bound", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_allocation_restriction", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", - "MetricExpr": "TOPDOWN_BE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "Slots;TopdownL1;tma_L1_group", - "MetricName": "tma_backend_bound", - "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.ALL_P / (6 * CPU_CLK_UNHALT= ED.CORE)", - "MetricGroup": "Slots;TopdownL1;tma_L1_group", - "MetricName": "tma_bad_speculation", - "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", - "MetricName": "tma_branch_detect", - "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", - "MetricName": "tma_branch_mispredicts", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch", - "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", - "MetricName": "tma_branch_resteer", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS)", - "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (6 * CPU_CLK_UNHALTED.CORE)= ", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", - "MetricName": "tma_cisc", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", - "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", - "MetricName": "tma_core_bound", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls", - "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", - "MetricName": "tma_decode", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (6 * CPU_CLK_UNH= ALTED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", - "MetricName": "tma_fast_nuke", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls", - "MetricExpr": "TOPDOWN_FE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "Slots;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "MetricgroupNoGroup": "TopdownL1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses", - "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", - "MetricName": "tma_ifetch_bandwidth", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations", - "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (6 * CPU_CLK_UN= HALTED.CORE)", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", - "MetricName": "tma_ifetch_latency", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", - "MetricExpr": "INST_RETIRED.ANY / FP_FLOPS_RETIRED.ALL", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipflop" - }, - { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", - "MetricExpr": "INST_RETIRED.ANY / (FP_INST_RETIRED.128B_DP + FP_IN= ST_RETIRED.128B_SP)", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128" - }, - { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", - "MetricExpr": "INST_RETIRED.ANY / FP_INST_RETIRED.64B_DP", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp" - }, - { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", - "MetricExpr": "INST_RETIRED.ANY / FP_INST_RETIRED.32B_SP", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "100 * (LD_HEAD.DTLB_MISS_AT_RET + LD_HEAD.PGWALK_AT= _RET) / CPU_CLK_UNHALTED.CORE", - "MetricGroup": "Cycles", - "MetricName": "tma_info_bottleneck_dtlb_miss_bound_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.ALL / CPU_CLK_UNHALTE= D.CORE", - "MetricGroup": "Cycles;Ifetch", - "MetricName": "tma_info_bottleneck_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.ALL / CPU_CLK_UNHALTED.= CORE", - "MetricGroup": "Cycles;Load_Store_Miss", - "MetricName": "tma_info_bottleneck_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "100 * LD_HEAD.ANY_AT_RET / CPU_CLK_UNHALTED.CORE", - "MetricGroup": "Cycles;Mem_Exec", - "MetricName": "tma_info_bottleneck_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricName": "tma_info_br_inst_mix_ipbranch" - }, - { - "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricName": "tma_info_br_inst_mix_ipcall" - }, - { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricName": "tma_info_br_inst_mix_ipfarbranch" - }, - { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", - "MetricExpr": "INST_RETIRED.ANY / (BR_MISP_RETIRED.COND - BR_MISP_= RETIRED.COND_TAKEN)", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken" - }, - { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken" - }, - { - "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", - "MetricName": "tma_info_br_inst_mix_ipmisp_indirect" - }, - { - "BriefDescription": "Instructions per retired return Branch Mispre= diction", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RETURN", - "MetricName": "tma_info_br_inst_mix_ipmisp_ret" - }, - { - "BriefDescription": "Instructions per retired Branch Misprediction= ", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricName": "tma_info_br_inst_mix_ipmispredict" - }, - { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.ALL_= BRANCHES", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o" - }, - { - "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.LD_BUF / CPU_CLK_UNHALTED= .CORE", - "MetricName": "tma_info_buffer_stalls_load_buffer_stall_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.RSV / CPU_CLK_UNHALTED.CO= RE", - "MetricName": "tma_info_buffer_stalls_mem_rsv_stall_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.ST_BUF / CPU_CLK_UNHALTED= .CORE", - "MetricName": "tma_info_buffer_stalls_store_buffer_stall_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY", - "MetricName": "tma_info_core_cpi", - "ScaleUnit": "1per_instr" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "FP_FLOPS_RETIRED.ALL / CPU_CLK_UNHALTED.CORE", - "MetricGroup": "Flops", - "MetricName": "tma_info_core_flopc" - }, - { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "INST_RETIRED.ANY / CPU_CLK_UNHALTED.CORE", - "MetricName": "tma_info_core_ipc" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "TOPDOWN_RETIRING.ALL_P / INST_RETIRED.ANY", - "MetricName": "tma_info_core_upi" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.L2_HIT / MEM_BOUND_ST= ALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l2h= it", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.LLC_HIT / MEM_BOUND_S= TALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.L2_HIT / MEM_BOUND_STAL= LS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_loadmissbound_with_l2hit", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.LLC_HIT / MEM_BOUND_STA= LLS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_loadmissbound_with_l3hit", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", - "MetricExpr": "100 * LD_HEAD.L1_BOUND_AT_RET / CPU_CLK_UNHALTED.CO= RE", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_l1_bound" - }, - { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", - "MetricExpr": "100 * (LD_HEAD.L1_BOUND_AT_RET + MEM_BOUND_STALLS_L= OAD.ALL) / CPU_CLK_UNHALTED.CORE", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_load_bound" - }, - { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", - "MetricExpr": "100 * (MEM_SCHEDULER_BLOCK.ST_BUF / MEM_SCHEDULER_B= LOCK.ALL) * tma_mem_scheduler", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_store_bound" - }, - { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", - "MetricExpr": "1e3 * MACHINE_CLEARS.FP_ASSIST / INST_RETIRED.ANY", - "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki" - }, - { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", - "MetricExpr": "1e3 * MACHINE_CLEARS.PAGE_FAULT / INST_RETIRED.ANY", - "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki" - }, - { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", - "MetricExpr": "1e3 * MACHINE_CLEARS.SMC / INST_RETIRED.ANY", - "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "100 * LD_BLOCKS.ADDRESS_ALIAS / MEM_UOPS_RETIRED.AL= L_LOADS", - "MetricName": "tma_info_mem_exec_blocks_loads_with_adressaliasing", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "100 * LD_BLOCKS.DATA_UNKNOWN / MEM_UOPS_RETIRED.ALL= _LOADS", - "MetricName": "tma_info_mem_exec_blocks_loads_with_storefwdblk", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "100 * LD_HEAD.L1_MISS_AT_RET / LD_HEAD.ANY_AT_RET", - "MetricName": "tma_info_mem_exec_bound_loadhead_with_l1miss", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "100 * LD_HEAD.OTHER_AT_RET / LD_HEAD.ANY_AT_RET", - "MetricName": "tma_info_mem_exec_bound_loadhead_with_otherpipeline= blks", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "100 * LD_HEAD.PGWALK_AT_RET / LD_HEAD.ANY_AT_RET", - "MetricName": "tma_info_mem_exec_bound_loadhead_with_pagewalk", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "100 * LD_HEAD.DTLB_MISS_AT_RET / LD_HEAD.ANY_AT_RET= ", - "MetricName": "tma_info_mem_exec_bound_loadhead_with_stlbhit", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "100 * LD_HEAD.ST_ADDR_AT_RET / LD_HEAD.ANY_AT_RET", - "MetricName": "tma_info_mem_exec_bound_loadhead_with_storefwding", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Instructions per Load", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", - "MetricName": "tma_info_mem_mix_ipload" - }, - { - "BriefDescription": "Instructions per Store", - "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", - "MetricName": "tma_info_mem_mix_ipstore" - }, - { - "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", - "MetricExpr": "100 * MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRE= D.ALL_LOADS", - "MetricName": "tma_info_mem_mix_load_locks_ratio" - }, - { - "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", - "MetricExpr": "100 * MEM_UOPS_RETIRED.SPLIT_LOADS / MEM_UOPS_RETIR= ED.ALL_LOADS", - "MetricName": "tma_info_mem_mix_load_splits_ratio" - }, - { - "BriefDescription": "Ratio of mem load uops to all uops", - "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / TOPDOWN_RETIRING= .ALL_P", - "MetricName": "tma_info_mem_mix_memload_ratio" - }, - { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "100 * SERIALIZATION.C01_MS_SCB / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricName": "tma_info_serialization_tpause_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "tma_info_system_cpu_utilization" - }, - { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "FP_FLOPS_RETIRED.ALL / (duration_time * 1e9)", - "MetricGroup": "Flops", - "MetricName": "tma_info_system_gflops", - "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" - }, - { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.CORE_P:k / CPU_CLK_UNHALTED.CORE", - "MetricName": "tma_info_system_kernel_utilization" - }, - { - "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", - "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", - "MetricName": "tma_info_system_mux" - }, - { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", - "MetricName": "tma_info_system_turbo_utilization" - }, - { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.FPDIV / TOPDOWN_RETIRING.ALL_P", - "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio" - }, - { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * UOPS_RETIRED.IDIV / TOPDOWN_RETIRING.ALL_P", - "MetricName": "tma_info_uop_mix_idiv_uop_ratio" - }, - { - "BriefDescription": "Percentage of all uops which are microcode op= s", - "MetricExpr": "100 * UOPS_RETIRED.MS / TOPDOWN_RETIRING.ALL_P", - "MetricName": "tma_info_uop_mix_microcode_uop_ratio" - }, - { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * UOPS_RETIRED.X87 / TOPDOWN_RETIRING.ALL_P", - "MetricName": "tma_info_uop_mix_x87_uop_ratio" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses", - "MetricExpr": "TOPDOWN_FE_BOUND.ITLB_MISS / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (6 * CPU_C= LK_UNHALTED.CORE)", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", - "MetricName": "tma_machine_clears", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", - "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", - "MetricName": "tma_mem_scheduler", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", - "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", - "MetricName": "tma_non_mem_scheduler", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", - "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (6 * CPU_CLK_UNHALTE= D.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", - "MetricName": "tma_nuke", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized", - "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", - "MetricName": "tma_other_fb", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes", - "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", - "MetricName": "tma_predecode", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", - "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (6 * CPU_CLK_UNHALTED.C= ORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", - "MetricName": "tma_register", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", - "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", - "MetricName": "tma_reorder_buffer", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", - "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", - "MetricName": "tma_resource_bound", - "MetricgroupNoGroup": "TopdownL2", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that result = in retirement slots", - "MetricExpr": "TOPDOWN_RETIRING.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "Slots;TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "MetricgroupNoGroup": "TopdownL1", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", - "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", - "MetricName": "tma_serialization", - "ScaleUnit": "100%" - }, { "BriefDescription": "Uncore operating frequency in GHz", "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTIC= KS) * #num_packages) / 1e9 / duration_time", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/other.json b/tools/p= erf/pmu-events/arch/x86/grandridge/other.json index 28f9a4c3ea84..daa16030d493 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/other.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/other.json @@ -8,26 +8,6 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5,6,7", @@ -37,13 +17,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4" } ] diff --git a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json b/tool= s/perf/pmu-events/arch/x86/grandridge/pipeline.json index 40fa4f5ae261..a934b64f66d0 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json @@ -56,6 +56,14 @@ "SampleAfterValue": "200003", "UMask": "0xfb" }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef" + }, { "BriefDescription": "This event is deprecated. Refer to new event = BR_INST_RETIRED.INDIRECT_CALL", "Counter": "0,1,2,3,4,5,6,7", @@ -81,6 +89,30 @@ "SampleAfterValue": "200003", "UMask": "0xf7" }, + { + "BriefDescription": "Counts the number of near taken branch instru= ctions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd" + }, + { + "BriefDescription": "Counts the number of near relative JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_JMP", + "SampleAfterValue": "200003", + "UMask": "0xdf" + }, { "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", @@ -121,6 +153,14 @@ "SampleAfterValue": "200003", "UMask": "0xfb" }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef" + }, { "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired.", "Counter": "0,1,2,3,4,5,6,7", @@ -236,8 +276,9 @@ "UMask": "0x20" }, { - "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "BriefDescription": "This event is deprecated.", "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", "EventCode": "0xc3", "EventName": "MACHINE_CLEARS.SLOW", "SampleAfterValue": "20003", @@ -259,6 +300,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL_P]", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json index 6a80cf6cbd36..b89ab6e5cfb5 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json @@ -8,6 +8,16 @@ "PortMask": "0x000", "Unit": "CHACMS" }, + { + "BriefDescription": "Counts the number of cycles FAST trigger is r= eceived from the global FAST distress wire.", + "Counter": "0,1,2,3", + "EventCode": "0x34", + "EventName": "UNC_CHACMS_RING_SRC_THRTL", + "Experimental": "1", + "PerPkg": "1", + "PortMask": "0x000", + "Unit": "CHACMS" + }, { "BriefDescription": "Number of CHA clock cycles while the event is= enabled", "Counter": "0,1,2,3", @@ -530,6 +540,26 @@ "UMask": "0x4", "Unit": "CHA" }, + { + "BriefDescription": "Ingress (from CMS) Allocations : IRQ : Counts= number of allocations per cycle into the specified Ingress queue.", + "Counter": "0,1,2,3", + "EventCode": "0x13", + "EventName": "UNC_CHA_RxC_INSERTS.IRQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "CHA" + }, + { + "BriefDescription": "Ingress (from CMS) Occupancy : IRQ : Counts n= umber of entries in the specified Ingress queue in each cycle.", + "Counter": "0", + "EventCode": "0x11", + "EventName": "UNC_CHA_RxC_OCCUPANCY.IRQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "CHA" + }, { "BriefDescription": "All TOR Inserts", "Counter": "0,1,2,3", @@ -603,7 +633,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt prefetch from local IA that mis= s the cache", + "BriefDescription": "Data read opt prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT_PREF", @@ -764,7 +794,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Last level cache prefetch read for ownership = from local IA that miss the cache", + "BriefDescription": "Last level cache prefetch read for ownership = from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_LLCPREFRFO", @@ -859,7 +889,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt_Prefs issued by iA Cor= es that missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt prefetch from lo= cal iA that missed the LLC targeting local memory", "UMask": "0xc8a6fe01", "Unit": "CHA" }, @@ -934,7 +964,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_LOCAL", @@ -954,7 +984,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_LOCAL", @@ -1024,7 +1054,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO", @@ -1034,7 +1064,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO_PREF", @@ -1406,7 +1436,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_DRD_OPT", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opts issued by iA Cores", "UMask": "0xc827ff01", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json index e75b3050ccd5..6a11e5505957 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json @@ -188,6 +188,256 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x10", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x20", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x40", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x80", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e and all pages are closed", + "Counter": "0,1,2,3", + "EventCode": "0x88", + "EventName": "UNC_M_POWER_CHANNEL_PPD_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "MR4 temp reading is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.MR4BLKEN", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "RAPL is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.RAPLBLK", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -360,6 +610,94 @@ "PerPkg": "1", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 1b592cf63940..ed7a1845d43d 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -12,7 +12,7 @@ GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core -GenuineIntel-6-B6,v1.05,grandridge,core +GenuineIntel-6-B6,v1.07,grandridge,core GenuineIntel-6-A[DE],v1.06,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AFD8E1D95A3 for ; Sat, 22 Mar 2025 06:34:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625314; cv=none; b=b+dFRVjf1sunwLsoIPgygriY1K1IycabbYWF/mfM3T4HxayjPEXPuxyLBJmL11Hca1IXGfu4jmvt437JQn2Q4byMU6v7TtX17N5vuluAA6e3EmqFixiKLJyWeh+MCyBJr4c8IgVTQc+gCTqIo1P5vyTOuJ5nPPpNgG7Js56nSFQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625314; c=relaxed/simple; bh=VUw1fcLNkQyTdRlLD33gHLlCXcSYtZHb3F8Y5x/zUf0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=sGJGQWKMOfF8xTMHqpxVkiG/Z5fKcPnlpKtmU6luh/yhheGOeRGQFAdsvNpiixRxVlmSkVg2KC/7PnNHhx9OLRxMfBqtNp7Zl1s7qXjwpXz28uoEG6BRrSD6j58TvMNx7yHreFtgBIFDWwdlE+bOcu5S03io8MvikOfy82w7mpc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=cbREq6d1; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="cbREq6d1" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6fed889e353so38768897b3.3 for ; Fri, 21 Mar 2025 23:34:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625296; x=1743230096; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Q2wBK9I1dTSU9BJYYyCMSSg8VMlpEyof7o+h6tqNxVI=; b=cbREq6d1RU4E9VQBJjVCoew9c+KqPyt/X4jVtkHkIL8E88DYVROIkIWffGYcZKYz05 n5DLthBTgqmC3ymsRMY66vtUBTg/RKFPqMMwfFoiSLe3y+ujCZNc6eqRC/YZ//5822if 3zoV8EFgMkYlaDq8qWGHfBJMS/NUwwvQ+EIAnyiUewUDrMn6NQCZ2ore6nagkKyuvQRc DxKtm9mRj3BNdy0xboQDvQoF/7PyXQOab0SO5y+AbIqJsXzoisE6d5NvYMBmawTqoEtY jwaAhSX99Wh5UpkJ43kOkioq/c3qLQKhTTblPEQ2NWJyNknylWq+XM+SpFoBr80P0OOg KRHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625296; x=1743230096; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Q2wBK9I1dTSU9BJYYyCMSSg8VMlpEyof7o+h6tqNxVI=; b=VZrGIdF//PXU7fw51JlonsYZw6BkUFnQRz1VJ79rT+n5xQN21y2yAK10BBrrqoGJfK wLhY7TDXoG7ISoeWcxNmJpSAp8bBSJhOwXUDZAEQYfMOzFBbkwoDAO9YKU/rNI9AINQH Af/Cwt74lIj0REHHWG0HFqkgC8CaltNgPk6HfDmACb+sEigqdjkyyjLN+/Ms4E/PuZSq mLMTFINZWjsEheyFeJl7TIKCbdGFyEjLfBm0Sc1KR+cONvnyFGGPSuyJv6gzWlpkk7g6 5qz7b6py5zTTOxCL7nJYU+APzXlVsPJ/60hZkwmitl20rI/yZZTXMFHV9hrPxoQuiAyu 4eYA== X-Forwarded-Encrypted: i=1; AJvYcCW3g5UVv1WX0zlYp4BWqDwttHUdUSU4ZZLdVg+Hke1Kc7Klhmvt35mKLQ2AhO3QPO0RMQDo588pe3M1nUk=@vger.kernel.org X-Gm-Message-State: AOJu0Yzdohsq9tk673J/1+hdhLCMimvvcu9WIPzj0WvW1jTab1ixAz+0 YYy9Ri13xl25BlkLEpw5WRX9kA8SRrnjoZAiy0MJ+tE23RGanM2J8dmTAQxoRzYkgEvQILstad7 b1DRP0g== X-Google-Smtp-Source: AGHT+IGj5qXrpI+iy2j7Py+Jn2j1tPuZvub6APNcMq/QjpNFmxUp+MXudhsRHwdX4WoJB2ziRIwUVAS1tuJ7 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a81:f00b:0:b0:6ff:6fa4:ea5a with SMTP id 00721157ae682-700ba9bd6aemr41747b3.0.1742625295461; Fri, 21 Mar 2025 23:34:55 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:41 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-14-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 13/35] perf vendor events: Add graniterapids retirement latencies From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add retirement latencies for use in place of retirement latency events. Update events from v1.06 to v1.08. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/graniterapids/cache.json | 122 +++++ .../arch/x86/graniterapids/counter.json | 5 + .../arch/x86/graniterapids/frontend.json | 21 + .../arch/x86/graniterapids/gnr-metrics.json | 483 +++++++++--------- .../arch/x86/graniterapids/memory.json | 130 +++++ .../arch/x86/graniterapids/other.json | 237 --------- .../arch/x86/graniterapids/pipeline.json | 52 ++ .../arch/x86/graniterapids/uncore-cache.json | 42 ++ .../graniterapids/uncore-interconnect.json | 90 +++- .../arch/x86/graniterapids/uncore-memory.json | 240 +++++++++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 11 files changed, 926 insertions(+), 498 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/cache.json index d155da8610d8..6fa985656312 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/cache.json @@ -351,6 +351,9 @@ "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", "PublicDescription": "Counts retired load instructions with locked= access.", + "RetirementLatencyMax": 5156, + "RetirementLatencyMean": 63.76, + "RetirementLatencyMin": 15, "SampleAfterValue": "100007", "UMask": "0x21" }, @@ -361,6 +364,9 @@ "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", + "RetirementLatencyMax": 4704, + "RetirementLatencyMean": 3.97, + "RetirementLatencyMin": 0, "SampleAfterValue": "100003", "UMask": "0x41" }, @@ -371,6 +377,9 @@ "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", + "RetirementLatencyMax": 65535, + "RetirementLatencyMean": 19.0, + "RetirementLatencyMin": 0, "SampleAfterValue": "100003", "UMask": "0x42" }, @@ -381,6 +390,9 @@ "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", + "RetirementLatencyMax": 3424, + "RetirementLatencyMean": 1.57, + "RetirementLatencyMin": 0, "SampleAfterValue": "100003", "UMask": "0x9" }, @@ -391,6 +403,9 @@ "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", + "RetirementLatencyMax": 65535, + "RetirementLatencyMean": 5.24, + "RetirementLatencyMin": 0, "SampleAfterValue": "100003", "UMask": "0xa" }, @@ -430,6 +445,9 @@ "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", + "RetirementLatencyMax": 4472, + "RetirementLatencyMean": 353.04, + "RetirementLatencyMin": 0, "SampleAfterValue": "20011", "UMask": "0x4" }, @@ -440,6 +458,9 @@ "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", + "RetirementLatencyMax": 830, + "RetirementLatencyMean": 125.27, + "RetirementLatencyMin": 0, "SampleAfterValue": "20011", "UMask": "0x1" }, @@ -460,6 +481,9 @@ "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", + "RetirementLatencyMax": 3939, + "RetirementLatencyMean": 289.9, + "RetirementLatencyMin": 0, "SampleAfterValue": "20011", "UMask": "0x2" }, @@ -470,6 +494,9 @@ "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", + "RetirementLatencyMax": 4146, + "RetirementLatencyMean": 115.83, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x1" }, @@ -479,6 +506,9 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", + "RetirementLatencyMax": 3572, + "RetirementLatencyMean": 430.22, + "RetirementLatencyMin": 0, "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -489,6 +519,9 @@ "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", + "RetirementLatencyMax": 8552, + "RetirementLatencyMean": 125.36, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x8" }, @@ -498,6 +531,9 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", + "RetirementLatencyMax": 2580, + "RetirementLatencyMean": 135.29, + "RetirementLatencyMin": 0, "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -548,6 +584,9 @@ "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", + "RetirementLatencyMax": 7140, + "RetirementLatencyMean": 5.71, + "RetirementLatencyMin": 0, "SampleAfterValue": "200003", "UMask": "0x2" }, @@ -568,6 +607,9 @@ "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", + "RetirementLatencyMax": 5630, + "RetirementLatencyMean": 57.64, + "RetirementLatencyMin": 0, "SampleAfterValue": "100021", "UMask": "0x4" }, @@ -598,6 +640,16 @@ "SampleAfterValue": "1000003", "UMask": "0x3" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit in the L3 or were snooped from another co= re's caches on the same socket.", "Counter": "0,1,2,3", @@ -618,6 +670,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit in the L3 o= r were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -678,6 +740,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that hit a modified = line in a distant L3 Cache or were snooped from a distant core's L1/L2 cach= es on this socket when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.SNC_CACHE.HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1008000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that either hit a no= n-modified line in a distant L3 Cache or were snooped from a distant core's= L1/L2 caches on this socket when the system is in SNC (sub-NUMA cluster) m= ode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.SNC_CACHE.HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x808000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -698,6 +790,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10808", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC4477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit in the L3 or were snooped from another core's caches on the sa= me socket.", "Counter": "0,1,2,3", @@ -718,6 +830,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json b/to= ols/perf/pmu-events/arch/x86/graniterapids/counter.json index 137da7efa8b1..5d3b202eadd3 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/counter.json @@ -73,5 +73,10 @@ "Unit": "MDF", "CountersNumFixed": "0", "CountersNumGeneric": "4" + }, + { + "Unit": "UBOX", + "CountersNumFixed": "0", + "CountersNumGeneric": "2" } ] \ No newline at end of file diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/frontend.json index dc81055941b1..77ebb46d104e 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json @@ -53,6 +53,9 @@ "MSRIndex": "0x3F7", "MSRValue": "0x1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", + "RetirementLatencyMax": 65535, + "RetirementLatencyMean": 2.46, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -75,6 +78,9 @@ "MSRIndex": "0x3F7", "MSRValue": "0x14", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", + "RetirementLatencyMax": 980, + "RetirementLatencyMean": 41.96, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -86,6 +92,9 @@ "MSRIndex": "0x3F7", "MSRValue": "0x12", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", + "RetirementLatencyMax": 1785, + "RetirementLatencyMean": 9.83, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -97,6 +106,9 @@ "MSRIndex": "0x3F7", "MSRValue": "0x13", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", + "RetirementLatencyMax": 2854, + "RetirementLatencyMean": 137.41, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -250,6 +262,9 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", + "RetirementLatencyMax": 65535, + "RetirementLatencyMean": 77.14, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -261,6 +276,9 @@ "MSRIndex": "0x3F7", "MSRValue": "0x15", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", + "RetirementLatencyMax": 754, + "RetirementLatencyMean": 206.85, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -271,6 +289,9 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", + "RetirementLatencyMax": 532, + "RetirementLatencyMean": 3.85, + "RetirementLatencyMin": 0, "SampleAfterValue": "100007", "UMask": "0x3" }, diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json = b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json index a345b6874606..9d64d7f5f222 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json @@ -310,7 +310,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -322,7 +322,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", + "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -330,12 +330,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -345,7 +345,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -361,12 +361,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -381,7 +381,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -389,7 +389,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependen= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_= bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma= _l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_fb= _full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tm= a_store_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_= latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_s= tore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)) + tma_mem= ory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store= + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming= _stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -397,22 +397,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms / (tma_ds= b + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + t= ma_fetch_bandwidth * tma_ms / (tma_dsb + tma_mite + tma_ms)) + 10 * tma_mic= rocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_bra= nch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes + = tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_info_= thread_clks * tma_ports_utilized_0) / (tma_amx_busy + tma_divider + tma_por= ts_utilization + tma_serializing_operation) + tma_microcode_sequencer / (tm= a_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_mic= rocode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -420,7 +420,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + = tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound= * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false= _sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -428,7 +428,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency + tma_streaming_stores - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -436,7 +436,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -448,10 +448,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -460,7 +460,7 @@ { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -473,24 +473,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -498,8 +498,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -507,24 +507,24 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * FRONTEND_RETIRED= .L1I_MISS:R / tma_info_thread_clks - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "FRONTEND_RETIRED.L2_MISS * FRONTEND_RETIRED.L2_MISS= :R / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -532,7 +532,7 @@ "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * FRONTEND_RETIRE= D.ITLB_MISS:R / tma_info_thread_clks - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -540,48 +540,48 @@ "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * FRONTEND_RETIRED.STLB_= MISS:R / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches.", "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * BR_MISP_RETIRED.= COND_NTAKEN_COST:R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_nt_mispredicts", - "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches.", "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * BR_MISP_RETIRED.C= OND_TAKEN_COST:R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_mispredicts", - "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (79 * tma_i= nfo_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < M= EM_LOAD_L3_HIT_RETIRED.XSNP_MISS:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS *= (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequen= cy) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RETIRED.XSNP_= FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_core_freque= ncy) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_HIT_RETIRED= .XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) * (OCR.DEMAND_DATA_R= D.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DA= TA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOA= D_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * min(MEM_LOAD_L= 3_HIT_RETIRED.XSNP_MISS:R, 74.6 * tma_info_system_core_frequency) + MEM_LOA= D_L3_HIT_RETIRED.XSNP_FWD * min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD:R, 76.6 * = tma_info_system_core_frequency) * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (= OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_= WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -592,24 +592,24 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * MEM_LOA= D_L3_HIT_RETIRED.XSNP_NO_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * (79 *= tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if= 0 < MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSN= P_NO_FWD * (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_co= re_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RET= IRED.XSNP_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_H= IT_RETIRED.XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_inf= o_system_core_frequency) - 4.4 * tma_info_system_core_frequency) * (1 - OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM += OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB= _HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * min(MEM_LOAD= _L3_HIT_RETIRED.XSNP_NO_FWD:R, 74.6 * tma_info_system_core_frequency) + MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD * min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD:R, 74.= 6 * tma_info_system_core_frequency) * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_= HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SN= OOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MI= SS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -618,7 +618,7 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%" }, @@ -627,7 +627,7 @@ "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -637,7 +637,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -645,34 +645,34 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * MEM_INST_RET= IRED.STLB_HIT_LOADS:R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 < MEM_INST= _RETIRED.STLB_HIT_LOADS:R else MEM_INST_RETIRED.STLB_HIT_LOADS * 7) / tma_i= nfo_thread_clks + tma_load_stlb_miss", + "MetricExpr": "MEM_INST_RETIRED.STLB_HIT_LOADS * min(MEM_INST_RETI= RED.STLB_HIT_LOADS:R, 7) / tma_info_thread_clks + tma_load_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * MEM_INST_RE= TIRED.STLB_HIT_STORES:R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if 0 < MEM_I= NST_RETIRED.STLB_HIT_STORES:R else MEM_INST_RETIRED.STLB_HIT_STORES * 7) / = tma_info_thread_clks + tma_store_stlb_miss", + "MetricExpr": "MEM_INST_RETIRED.STLB_HIT_STORES * min(MEM_INST_RET= IRED.STLB_HIT_STORES:R, 7) / tma_info_thread_clks + tma_store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricExpr": "(170 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_MISS@offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core_frequ= ency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, @@ -693,7 +693,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -704,7 +704,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -722,7 +722,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -731,15 +731,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -747,8 +747,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -756,8 +756,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -765,8 +765,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -774,8 +774,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -783,8 +783,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", "ScaleUnit": "100%" }, { @@ -795,27 +795,27 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" }, { @@ -823,24 +823,24 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= .", "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRE= D.INDIRECT_CALL_COST:R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_call_mispredicts", - "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions.= ", "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * BR_MISP_RETIRE= D.INDIRECT_COST:R - BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRED.IN= DIRECT_CALL_COST:R) / tma_info_thread_clks, 0)", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_jump_mispredicts", - "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -851,28 +851,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -900,7 +900,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -908,7 +908,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -916,10 +916,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -980,11 +981,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -997,8 +998,8 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, @@ -1011,7 +1012,7 @@ }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, @@ -1061,10 +1062,10 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", @@ -1073,7 +1074,7 @@ "MetricName": "tma_info_frontend_unknown_branches_ret" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1091,7 +1092,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1099,7 +1100,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1107,7 +1108,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1115,7 +1116,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1123,7 +1124,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1131,7 +1132,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1139,7 +1140,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1194,7 +1195,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1331,7 +1332,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1396,21 +1397,21 @@ "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1452,8 +1453,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1474,18 +1475,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1548,14 +1549,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1566,14 +1566,14 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1599,7 +1599,7 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricExpr": "uncore_cha_0@event\\=3D0x1@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, @@ -1629,7 +1629,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1638,15 +1638,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1656,13 +1655,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1678,14 +1677,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 9" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1694,7 +1693,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%" }, { @@ -1702,8 +1701,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1711,8 +1710,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1720,8 +1719,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1729,7 +1728,7 @@ "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1738,7 +1737,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1747,16 +1746,16 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", - "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * MEM_LOAD_RETIRED.L2_= HIT:R, MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_info_system_core_frequency)) if= 0 < MEM_LOAD_RETIRED.L2_HIT:R else MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_in= fo_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * min(MEM_LOAD_RETIRED.L2_H= IT:R, 4.4 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT = / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1765,17 +1764,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * MEM_LOAD_RETIRED.L3_= HIT:R, MEM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.= 4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_RETIRED.L3_HIT:R else M= EM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.4 * tma_= info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "MEM_LOAD_RETIRED.L3_HIT * min(MEM_LOAD_RETIRED.L3_H= IT:R, 32.6 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1783,19 +1782,19 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1812,7 +1811,7 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1820,31 +1819,31 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1852,7 +1851,7 @@ "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1861,7 +1860,7 @@ "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * MEM_INST_RETIRED.LOCK= _LOADS:R / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1877,19 +1876,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1898,32 +1897,32 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1944,7 +1943,7 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1958,17 +1957,17 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clks = / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1976,10 +1975,10 @@ }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1989,7 +1988,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1997,7 +1996,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -2011,19 +2010,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -2032,7 +2031,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%" }, { @@ -2041,7 +2040,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2050,7 +2049,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2059,7 +2058,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2067,8 +2066,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -2076,8 +2075,8 @@ "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -2085,7 +2084,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -2095,8 +2094,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%" }, { @@ -2105,7 +2104,7 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, @@ -2114,8 +2113,8 @@ "MetricExpr": "(MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM:R + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * MEM_LOA= D_L3_MISS_RETIRED.REMOTE_FWD:R) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_R= ETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -2123,22 +2122,22 @@ "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s.", "MetricExpr": "BR_MISP_RETIRED.RET_COST * BR_MISP_RETIRED.RET_COST= :R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ret_mispredicts", - "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & (tma_branch_mispr= edicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2151,7 +2150,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2160,8 +2159,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%" }, { @@ -2170,26 +2169,26 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * MEM_INST_RETIRE= D.SPLIT_LOADS:R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_r= eal_latency) if 0 < MEM_INST_RETIRED.SPLIT_LOADS:R else MEM_INST_RETIRED.SP= LIT_LOADS * tma_info_memory_load_miss_real_latency) / tma_info_thread_clks", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_LOADS * min(MEM_INST_RETIRED= .SPLIT_LOADS:R, tma_info_memory_load_miss_real_latency) / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * MEM_INST_RETIR= ED.SPLIT_STORES:R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < MEM_INST_RETIRED.S= PLIT_STORES:R else MEM_INST_RETIRED.SPLIT_STORES) / tma_info_thread_clks", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES * min(MEM_INST_RETIRE= D.SPLIT_STORES:R, 1) / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -2197,7 +2196,7 @@ "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -2206,8 +2205,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -2215,8 +2214,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -2224,8 +2223,8 @@ "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2242,7 +2241,7 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -2250,31 +2249,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2282,7 +2281,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2291,7 +2290,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2300,8 +2299,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json b/too= ls/perf/pmu-events/arch/x86/graniterapids/memory.json index 5da5a10275ba..deddfb4686e1 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/memory.json @@ -194,6 +194,16 @@ "SampleAfterValue": "1000003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the local socket's L1, L= 2, or L3 caches.", "Counter": "0,1,2,3", @@ -204,6 +214,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -214,6 +244,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were n= ot supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -224,6 +284,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -254,6 +334,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x733004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/other.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/other.json index 8df37f303273..99fcdc341272 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/other.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/other.json @@ -16,204 +16,6 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, - { - "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb7", - "EventName": "EXE.AMX_BUSY", - "SampleAfterValue": "2000003", - "UMask": "0x2" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC0002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10808", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC4477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x70C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F33004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x733004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -224,45 +26,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.WRITE_ESTIMATE.MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFBFF80822", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7" - }, - { - "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/pipeline.json index da6478607984..8530d93849fa 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json @@ -154,6 +154,9 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", + "RetirementLatencyMax": 888, + "RetirementLatencyMean": 6.11, + "RetirementLatencyMin": 0, "SampleAfterValue": "400009", "UMask": "0x50" }, @@ -171,6 +174,9 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", + "RetirementLatencyMax": 2750, + "RetirementLatencyMean": 5.09, + "RetirementLatencyMin": 0, "SampleAfterValue": "400009", "UMask": "0x41" }, @@ -197,6 +203,9 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", + "RetirementLatencyMax": 703, + "RetirementLatencyMean": 15.56, + "RetirementLatencyMin": 0, "SampleAfterValue": "400009", "UMask": "0x42" }, @@ -205,6 +214,9 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_COST", + "RetirementLatencyMax": 1562, + "RetirementLatencyMean": 11.07, + "RetirementLatencyMin": 0, "SampleAfterValue": "100003", "UMask": "0xc0" }, @@ -239,6 +251,9 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET_COST", + "RetirementLatencyMax": 1082, + "RetirementLatencyMean": 32.37, + "RetirementLatencyMin": 9, "SampleAfterValue": "100007", "UMask": "0x48" }, @@ -401,6 +416,14 @@ "SampleAfterValue": "1000003", "UMask": "0x4" }, + { + "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb7", + "EventName": "EXE.AMX_BUSY", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", "Counter": "0,1,2,3,4,5,6,7", @@ -774,6 +797,35 @@ "SampleAfterValue": "100003", "UMask": "0x2" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json= b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json index 53055986534d..b782f6d54fc2 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json @@ -853,6 +853,16 @@ "UMask": "0x8", "Unit": "CHA" }, + { + "BriefDescription": "Ingress (from CMS) Allocations : IRQ : Counts= number of allocations per cycle into the specified Ingress queue.", + "Counter": "0,1,2,3", + "EventCode": "0x13", + "EventName": "UNC_CHA_RxC_INSERTS.IRQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "CHA" + }, { "BriefDescription": "Ingress (from CMS) Occupancy : IRQ : Counts n= umber of entries in the specified Ingress queue in each cycle.", "Counter": "0", @@ -863,6 +873,38 @@ "UMask": "0x1", "Unit": "CHA" }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking exclusive lines in the core's cache. Snoop filter capacity e= victions occur when the snoop filter is full and evicts an existing entry t= o track a new entry. Does not count clean evictions such as when a core's c= ache replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.E_STATE", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : E state", + "UMask": "0x2", + "Unit": "CHA" + }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking modified lines in the core's cache. Snoop filter capacity ev= ictions occur when the snoop filter is full and evicts an existing entry to= track a new entry. Does not count clean evictions such as when a core's ca= che replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.M_STATE", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : M state", + "UMask": "0x1", + "Unit": "CHA" + }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking shared lines in the core's cache. Snoop filter capacity evic= tions occur when the snoop filter is full and evicts an existing entry to t= rack a new entry. Does not count clean evictions such as when a core's cach= e replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.S_STATE", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : S state", + "UMask": "0x4", + "Unit": "CHA" + }, { "BriefDescription": "All TOR Inserts", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconne= ct.json b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.= json index 5c50275c79b0..e5bd11b27bcd 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json @@ -1076,7 +1076,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for AD_BNC", + "BriefDescription": "Egress bypasses for AD_BNC", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.AD_BNC", @@ -1086,7 +1086,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for AD_CRD", + "BriefDescription": "Egress bypasses for AD_CRD", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.AD_CRD", @@ -1096,7 +1096,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for AK", + "BriefDescription": "Egress bypasses for AK", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.AK", @@ -1106,7 +1106,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for BL_BNC", + "BriefDescription": "Egress bypasses for BL_BNC", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.BL_BNC", @@ -1116,7 +1116,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for BL_CRD", + "BriefDescription": "Egress bypasses for BL_CRD", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.BL_CRD", @@ -1126,7 +1126,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress bypasses for for IV", + "BriefDescription": "Egress bypasses for IV", "Counter": "0,1,2,3", "EventCode": "0x1E", "EventName": "UNC_MDF_TxR_BYPASS.IV", @@ -1136,7 +1136,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for AD_BNC", + "BriefDescription": "Number of egress inserts for AD_BNC", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.AD_BNC", @@ -1146,7 +1146,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for AD_CRD", + "BriefDescription": "Number of egress inserts for AD_CRD", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.AD_CRD", @@ -1156,7 +1156,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for AK", + "BriefDescription": "Number of egress inserts for AK", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.AK", @@ -1166,7 +1166,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for BL_BNC", + "BriefDescription": "Number of egress inserts for BL_BNC", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.BL_BNC", @@ -1176,7 +1176,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for BL_CRD", + "BriefDescription": "Number of egress inserts for BL_CRD", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.BL_CRD", @@ -1186,7 +1186,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Number of egress inserts for for IV", + "BriefDescription": "Number of egress inserts for IV", "Counter": "0,1,2,3", "EventCode": "0x1C", "EventName": "UNC_MDF_TxR_INSERTS.IV", @@ -1196,7 +1196,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for AD_BNC", + "BriefDescription": "Egress occupancy for AD_BNC", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.AD_BNC", @@ -1206,7 +1206,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for AD_CRD", + "BriefDescription": "Egress occupancy for AD_CRD", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.AD_CRD", @@ -1216,7 +1216,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for AK", + "BriefDescription": "Egress occupancy for AK", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.AK", @@ -1226,7 +1226,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for BL_BNC", + "BriefDescription": "Egress occupancy for BL_BNC", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.BL_BNC", @@ -1236,7 +1236,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for BL_CRD", + "BriefDescription": "Egress occupancy for BL_CRD", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.BL_CRD", @@ -1246,7 +1246,7 @@ "Unit": "MDF" }, { - "BriefDescription": "Egress occupancy for for IV", + "BriefDescription": "Egress occupancy for IV", "Counter": "0,1,2,3", "EventCode": "0x1D", "EventName": "UNC_MDF_TxR_OCCUPANCY.IV", @@ -1932,5 +1932,59 @@ "Experimental": "1", "PerPkg": "1", "Unit": "UPI" + }, + { + "BriefDescription": "Message Received : Doorbell", + "Counter": "0,1", + "EventCode": "0x42", + "EventName": "UNC_U_EVENT_MSG.DOORBELL_RCVD", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x8", + "Unit": "UBOX" + }, + { + "BriefDescription": "Message Received : Interrupt", + "Counter": "0,1", + "EventCode": "0x42", + "EventName": "UNC_U_EVENT_MSG.INT_PRIO", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Message Received : Interrupt : Interrupts", + "UMask": "0x10", + "Unit": "UBOX" + }, + { + "BriefDescription": "Message Received : IPI", + "Counter": "0,1", + "EventCode": "0x42", + "EventName": "UNC_U_EVENT_MSG.IPI_RCVD", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Message Received : IPI : Inter Processor Int= errupts", + "UMask": "0x4", + "Unit": "UBOX" + }, + { + "BriefDescription": "Message Received : MSI", + "Counter": "0,1", + "EventCode": "0x42", + "EventName": "UNC_U_EVENT_MSG.MSI_RCVD", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Message Received : MSI : Message Signaled In= terrupts - interrupts sent by devices (including PCIe via IOxAPIC) (Socket = Mode only)", + "UMask": "0x2", + "Unit": "UBOX" + }, + { + "BriefDescription": "Message Received : VLW", + "Counter": "0,1", + "EventCode": "0x42", + "EventName": "UNC_U_EVENT_MSG.VLW_RCVD", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Message Received : VLW : Virtual Logical Wir= e (legacy) message were received from Uncore.", + "UMask": "0x1", + "Unit": "UBOX" } ] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.jso= n b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json index 5f4783ff6ce5..b991f6e1afbe 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json @@ -188,6 +188,94 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, { "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", "Counter": "0,1,2,3", @@ -286,6 +374,70 @@ "PublicDescription": "-", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "MR4 temp reading is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.MR4BLKEN", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "RAPL is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.RAPLBLK", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -478,6 +630,94 @@ "UMask": "0x1", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index ed7a1845d43d..579b4fbd65d6 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -13,7 +13,7 @@ GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.07,grandridge,core -GenuineIntel-6-A[DE],v1.06,graniterapids,core +GenuineIntel-6-A[DE],v1.08,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core GenuineIntel-6-7[DE],v1.24,icelake,core --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 51F9C1DC046 for ; Sat, 22 Mar 2025 06:34:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625306; cv=none; b=bKlrCzArDL3k+ba4FNnaARXqt+cGRbfv77vIyq1he2g0TxR3r+pcIvJznJMZgyLfWnUn6qpSXqrXvCofPFnYMXxM3e6wU3AxJ5czBaTWmsuBNZDuoHdro8icQGevgOZ+PPkphTRhXa/cYbgdPLJxaAku2U/jcw6wkqqUX4S3JJA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625306; c=relaxed/simple; bh=KMrcEmWjay1vuPIKNWGmAQ7ellmM1MUzgyJMlgx3QDo=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=QSw6F21j5/ISfbKDPLZIvDQL2V2LvaH6heJd9H0wqANQpOoEp8pLSEQtYk7i+6AJWG01hf84vs0k/Z1rcK10+vUhqzDjVl+g0Z87gW3FIOJNZwK5G0j2rjYPmrg4wiUU4PeWopK+Mvz1yEZaz2shSij+b4r7tc7DM8Oc4dTOGEg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=AaXFJEAI; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="AaXFJEAI" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fda4eaca22so25432867b3.2 for ; Fri, 21 Mar 2025 23:34:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625298; x=1743230098; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=/xU0wBAsl4jlEkSKnJ2JM7pI7tPOhIu8nmWNVPVFxSY=; b=AaXFJEAICEKoEruUWkSIp1a0AsBvE4guv9QBHCj0bhh1tlY1xOey5fU9RnM6V4m5T7 De3hbxNfKK8GT95hpT/pbD3bmKX4zHkDHqj0tKQmRxKXRg/n4V6CR/VpYz0HoXSSozPE XR2Fa5PMEBLg7d8i7OgPsluNDHWx/406a3U69ZA/Xz268LpuhYiJyVKW2AnXhKtO7xAj l23jqhpWgBRg1h7Axjo8dRcsjPociKXgNZsg3z24Pn8UtoCzme2LvfRi7RQzmLuTkRsG T3PkiAzeGSHKtWMkicifrYbumAj3MFyokIVIxFRjr+Y8+O+9Dg16wD8459iDW+PV8r1L TMbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625298; x=1743230098; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=/xU0wBAsl4jlEkSKnJ2JM7pI7tPOhIu8nmWNVPVFxSY=; b=rLgRsNWdXhhDXuAk+olnNFM5WDpk3EJAsVMHXE06VGvJLAA4u6YnqHIbNHZaonFLnz lG8Mn8ltpdrEK0/+m554thNVL8KoXaRqH2zfN8ZP3qCArf6pNqZcMOp1btOTF+W5dcJx LGFJ7z1Z8vmUtrDb4jxYbfx2kz7Cn5JM4clIM5Cn+DTMiMDdyOmRpGp8BbT/TORIUJjn aB86qyX45IkjMpii9XvjFB4xUAoUQY/3mTi8sxdJj/qmKQCuVEjZ1Bf1+3gxWoPr5MGi MwhgxPp5e74TOB36lUdfCDO8XEUeWhp9lLnkNvYlOqC0075kSbbVY0AOb9AaY7/QvXto EL9w== X-Forwarded-Encrypted: i=1; AJvYcCWOZ7czlj6XioDyMRpHXkEZ19+VPTsoqeEfY5Q+DiUempYbuXcUhU/uW0f3wdfAm2ijlcncAULMuErVvJk=@vger.kernel.org X-Gm-Message-State: AOJu0YzkWULp7L0tWx/MUa39YLhlyPeomPQA/ndii9FO36zPCLH+p1Gz IRg1ZpHYNjPrMo6oAuL1T93IFzVxr/NXw6XUC1YgxKrYAa5noWzBqB6+fiV/6YTQJs3FbqCBJKH y4lW0qw== X-Google-Smtp-Source: AGHT+IGsFOtHoau+2RzUCDtLJLzZqmnakD8PydqB2k0GQSXmogInHxWHEB6a/nq4x5k6lxrI+Dif445iqq5S X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:5a8f:b0:6fe:afd0:2083 with SMTP id 00721157ae682-700bacc4c78mr78357b3.3.1742625298179; Fri, 21 Mar 2025 23:34:58 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:42 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-15-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 14/35] perf vendor events: Update haswell metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/haswell/hsw-metrics.json | 206 +++++++++--------- 1 file changed, 102 insertions(+), 104 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json b/tool= s/perf/pmu-events/arch/x86/haswell/hsw-metrics.json index 0c1040b7e38c..b26ea70a3628 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -143,8 +143,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -153,8 +153,8 @@ "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { @@ -165,7 +165,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { @@ -174,8 +174,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -183,8 +183,8 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -193,8 +193,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { @@ -203,7 +203,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -220,8 +220,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, { @@ -229,8 +229,8 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -238,18 +238,18 @@ "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_= CORE / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE.= Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_cle= ars", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -279,33 +279,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -316,7 +316,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -328,7 +328,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -353,7 +353,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -398,7 +398,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -502,8 +502,8 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -537,14 +537,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -592,7 +591,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -601,8 +600,7 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -628,14 +626,14 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, @@ -644,8 +642,8 @@ "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -653,8 +651,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { @@ -663,8 +661,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { @@ -673,8 +671,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", "ScaleUnit": "100%" }, { @@ -682,18 +680,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -712,8 +710,8 @@ "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -724,15 +722,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -741,19 +739,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { @@ -762,7 +760,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -771,7 +769,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { @@ -779,8 +777,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -789,7 +787,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -798,7 +796,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -834,7 +832,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -843,7 +841,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -858,46 +856,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -917,7 +915,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -925,8 +923,8 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -934,7 +932,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -943,8 +941,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -952,8 +950,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -962,8 +960,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ACAD11AF0BB for ; Sat, 22 Mar 2025 06:35:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625311; cv=none; b=ItCvQ+mqfmXwRmseYUa4Sm/9S1mV9ge4rKJZnS2CiJlWOBXtTGzHzoSJd8w6sI9irsLI3aG+HdvKmci8mCgLf1K80KWDyybOfgRmxxaG7VXa2WDWu4dDYjyEJRWAQQcfEfrS6YiFDUK+BwFjaisw8CeZDZQ+XEih1KEn/lKcFdc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625311; c=relaxed/simple; bh=wTU5XM3LI5p+dSGarl/OC9mxlNqTvrUAVBHbQX7VLT4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=hhx9Tq3KnbvQStYktLnqMEoj3+Yf4YTddxGtfjDRsIzuGKpbGkCLdxRm8qQYL/mjMKS3O+ErrLtGV9BPGiPmz4zLkRQG7PDYQQncHmUT9eI9g3u2syQNnFz4w0hWaRR1/ZV5flUK0TiblWx8oFEKNSXFEDYW5/fkUmK0n779LWE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=nzHKjVt4; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="nzHKjVt4" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6f2a2ab50f6so32988837b3.3 for ; Fri, 21 Mar 2025 23:35:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625301; x=1743230101; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=oUFxYjBrTebQnEtyzKgARQbe01ALjiE4LybwOB+BCD4=; b=nzHKjVt4RLKeMOLnbj+dPN+4vy8XB9iHmmMQGm2A4gbq1waphlSCO8OxsVpPEo0Ttl Z/eQQkvwV6BOCjjR67sX+hnXw1mE4szcSIiept9PNsx4WKpAGbtoFqrE2E6xTx4Gsvhu SX+Z5w+Dy9gmkgCpNaganE3GMuRjYmtVuaB6wJ0IG62B15HIQa+qaBipHtIzq1TfnnR/ cGSvtyQUQ3XNGyZQyDnxImYgtaI7LlrKuT1MRtQG0Wl4LJERzfVaiNaMvJfcEzyvuL3U YhZ4CbRE8TCDcpo1rQFrsEGbK5xZJes0nupbE3zRmcLgLbf6cDTEKWfY8f+eIwPzZ8L/ 6NMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625301; x=1743230101; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=oUFxYjBrTebQnEtyzKgARQbe01ALjiE4LybwOB+BCD4=; b=q6TCKDb+Q97qfmCXzvAEvY36vZ1vl+1cGe8uEKf+85amLfXjAqnfPFlX8wjOn5Ptdk TdT4ukBsYN8jFbewD0JOmr8m94t+B8dU9MpZ32oLh73VH1x0pWugyhSUB78Kpq5An4N1 aI2BGdye1twBixbgp3yeSkw7YunnTzm2JUtcS1qgLcFum0eaClDihgixVY7rfNj+FxgE O6kdmc1QQzG7fayKdZW047dusGIFcKZIxZMhXfK/fkW6ncFdOzAWMGy9z39nb6UDjaxE 8SxJlckQ5gTNjkmhlGoaBFb7BC78aHbXRYjZrz040Ep9+ytDPBXJMXfszOh0YJytrNee wZKQ== X-Forwarded-Encrypted: i=1; AJvYcCXjuvIpoSn9/t5I+Kw3bHCrR2n8VSgcE0QXGpIMB16gUB763kLh5YyqJITDCTI/FpE328gaCU0Dp0BNyoE=@vger.kernel.org X-Gm-Message-State: AOJu0Yy1dTUXCaeUjZaJYRzJCR5OwBBR7zwpGInsXBpfHCnBqR2GQ1eK 8bX7oXwhVa9Td7HIFyEk68LZXzPMfXhG2vsXi/J0J7Cjk6f6u5N7Uawk4SHG2+x4UdioDpQ7JBd Xy1Y32w== X-Google-Smtp-Source: AGHT+IESbEXgey+ybxNPW1tIeMc86+Kx1V+q6aq91Crrja8n1LiP9iir1Qx9ziTwDag9PaM26inWVxwrg0vr X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:5605:b0:6fb:7772:a978 with SMTP id 00721157ae682-700bad35bf2mr26937b3.7.1742625300766; Fri, 21 Mar 2025 23:35:00 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:43 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-16-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 15/35] perf vendor events: Update haswellx metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/haswellx/hsx-metrics.json | 222 +++++++++--------- 1 file changed, 110 insertions(+), 112 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json b/too= ls/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json index 1a05b74be575..8245a98ad4b9 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -345,8 +345,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -355,8 +355,8 @@ "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { @@ -367,7 +367,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { @@ -376,8 +376,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -385,8 +385,8 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -395,8 +395,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", "ScaleUnit": "100%" }, { @@ -405,7 +405,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -413,7 +413,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -422,8 +422,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, { @@ -431,8 +431,8 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -440,18 +440,18 @@ "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -481,33 +481,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -518,7 +518,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -530,7 +530,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -555,7 +555,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -600,7 +600,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -704,8 +704,8 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -739,14 +739,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -757,14 +756,14 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", + "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" @@ -814,7 +813,7 @@ "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -823,8 +822,7 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -850,14 +848,14 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, @@ -866,8 +864,8 @@ "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -875,8 +873,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { @@ -885,8 +883,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { @@ -895,8 +893,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", "ScaleUnit": "100%" }, { @@ -904,18 +902,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -933,8 +931,8 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { @@ -943,8 +941,8 @@ "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -955,15 +953,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -972,19 +970,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { @@ -993,7 +991,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1002,7 +1000,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", "ScaleUnit": "100%" }, { @@ -1010,8 +1008,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1020,7 +1018,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1029,7 +1027,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1065,7 +1063,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1074,7 +1072,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1089,46 +1087,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1137,8 +1135,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1146,8 +1144,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { @@ -1167,7 +1165,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1175,8 +1173,8 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1184,7 +1182,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1193,8 +1191,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1202,8 +1200,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1212,8 +1210,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BC8D1DEFE1 for ; Sat, 22 Mar 2025 06:35:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625318; cv=none; b=TmRyfTqluY1i6j7JFHHs0QMDk9Ri+k4SiYElV17/HNThPAm1eAVVI8MPfY00udLb17tLCqFGGt7DAomja+v0Y3uH6UvtaMXaWwSHEQZqsu7KPIRXKZlD2n/aFUjfi/e/qD7qofwxKNiAbLxoatLy67c63TiEWCO8qpF9YFtmSMc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625318; c=relaxed/simple; bh=v/ReQe5prlLbIsdt0PORCPuX7vbeMk0uz2PnbUyHtrw=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=g6k5veB0NRaCyWk0w+bJLPh4G+Ha8+FI6bCRizn51D7/m4/u36rqmKu5ml/kF00o4Qe2MJq4RwX3z6WaMmtXvfgCGgL42ITltyt4A9Sri2vLFUGMepy1XuOh4vSMC09h9jKjbmZiAqP0rLjAy65NG7V08hggMANXbuBfU3iyTyY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XB5gv5NI; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XB5gv5NI" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e60aebf48e8so3163870276.0 for ; Fri, 21 Mar 2025 23:35:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625303; x=1743230103; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=asqUmkMdWggbcu2w7sAqFbNHL77vzOg63lQ9FcTOxkI=; b=XB5gv5NIVloyyv6p4kYjQeBpBnrPx1aKsgCL2lRVovFc+aYcvzRk7xX+/cfeZ7Susr mPsV3FgzmE7IdcRCWYv6H8OvI7Ze4cfHXRTljEZRm01lVAcGxMlJhBOgfqYPa59Ql2CM OhftjBX11vi8qHEoLdIe6rORoiUbTuCsm7gKgZ/SpV43sBeeqL8fQZcM4P52m9vagghI ioPID+KxxsOXCfW/aQaL/DQkoRWvRrK3QBiZO3icBh2H9DD/cD6vgGqB1e0Kl7wiDK0J 4asgqvLHKx/hipz3eLo4TxTZbXTnlHw/M/o4OBmovNqIyIkqhcBnjqo2gTmEllsPQx/Q xlzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625303; x=1743230103; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=asqUmkMdWggbcu2w7sAqFbNHL77vzOg63lQ9FcTOxkI=; b=TUOWIwTkDZ7gk6+yJJYaKI1L9L4MtOVZXPGNNDp0Ey19P4IrUlmeGee6eEqdzsNhm/ Izw6acwlK2Z7ML0b7TpqmBzw5wt+UM40jNkH+Ev6qt3E+dTalNv9eleuHB1uzwcLw9fT itrJoI4xWFhG6WyszrmXYB21hw1eUpC7WBV7GT3hTJvW9mfuU3xiV45HSu6NZDhn0+kY s5Yf88Wmuaz2zn/gzAGZZbDRWOvpFEXMFIIVa8O8mWG7jnBMOJ9+FepRPAMUmUYgGqgk zFz1IpHGApR8hftieu+BHL9A8l2YLoTVQEFTstkTs1swwO4iyXr9/pY9NlfCCqJSXX7y FWjg== X-Forwarded-Encrypted: i=1; AJvYcCVh3mhmhlT+E0+XaLyru6vbyh6jzffRgZnPKLZzFm6pX6KpwHgI4/HNu2+gf8oZWgciEAWr9R9gWky78+k=@vger.kernel.org X-Gm-Message-State: AOJu0Yxvw7O8i5plIGXxvIFlqaboJ1EerutYnrdeUSiVJzGd6XnZOpnx JWtS3p1RD77bgFn4+L4jPghyKPCWQIuhzOM0e0IdMk+msT0fZjM11fz8MXWsZBx3Qz6sN4tM1NJ k9gakew== X-Google-Smtp-Source: AGHT+IGOC0HcynX9LxKXPhk+Sz/sCd/CFK2HBomgaljYtP+pOIENj20Xa3VjaqU/OeCwI4mQVh5xqURXhNdb X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:d887:0:b0:e63:ef1a:f7d8 with SMTP id 3f1490d57ef6-e66a4f73b8amr3755276.5.1742625303351; Fri, 21 Mar 2025 23:35:03 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:44 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-17-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 16/35] perf vendor events: Update icelake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/icelake/cache.json | 60 +++ .../arch/x86/icelake/icl-metrics.json | 385 +++++++++--------- .../pmu-events/arch/x86/icelake/memory.json | 160 ++++++++ .../pmu-events/arch/x86/icelake/other.json | 220 ---------- 4 files changed, 412 insertions(+), 413 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelake/cache.json b/tools/perf= /pmu-events/arch/x86/icelake/cache.json index 015f70f157d1..e7bb2ca6f183 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelake/cache.json @@ -445,6 +445,16 @@ "SampleAfterValue": "50021", "UMask": "0x20" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit a cacheline in the L3 where a snoop was s= ent or not.", "Counter": "0,1,2,3", @@ -505,6 +515,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit a cacheline= in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", @@ -565,6 +585,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit a = cacheline in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", @@ -625,6 +655,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that hit a cacheline in the L3 where a= snoop was sent or not.", "Counter": "0,1,2,3", @@ -655,6 +695,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that hit a cacheline in the L3 where a snoop was sent or n= ot.", "Counter": "0,1,2,3", @@ -715,6 +765,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that hit a cacheline in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tool= s/perf/pmu-events/arch/x86/icelake/icl-metrics.json index 63e28a03dc60..c5bfdb2f288b 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,12 +129,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -149,7 +149,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -157,7 +157,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_stor= e_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_false_sharing = + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -165,22 +165,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_br= anches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tm= a_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms /= (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_clears_resteers + tma_= mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_= dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -188,7 +188,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -196,15 +196,15 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -216,17 +216,17 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", "MetricThreshold": "tma_bottleneck_useful_work > 20" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -248,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -257,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -266,24 +266,24 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -291,7 +291,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -299,33 +299,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -335,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -362,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -372,7 +372,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -382,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -390,26 +390,26 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -417,8 +417,8 @@ "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { @@ -437,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -447,7 +447,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -465,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -474,15 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -490,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -499,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -508,7 +508,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -517,7 +517,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -526,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -538,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -556,8 +556,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -569,28 +569,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -619,7 +619,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -628,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -637,10 +637,11 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -701,11 +702,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -718,20 +719,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -773,7 +774,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -791,7 +792,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -799,7 +800,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -807,7 +808,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -815,7 +816,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -823,7 +824,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -831,7 +832,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -886,7 +887,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 11", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1005,7 +1006,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1067,8 +1068,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1095,12 +1096,12 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1141,14 +1142,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1175,7 +1175,7 @@ "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1183,7 +1183,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1191,7 +1191,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1219,7 +1219,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1228,15 +1228,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1246,13 +1245,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1268,14 +1267,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 7.5" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1283,8 +1282,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1292,7 +1291,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1301,7 +1300,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1311,7 +1310,7 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1320,7 +1319,7 @@ "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1330,17 +1329,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1348,18 +1347,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1376,7 +1375,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1384,31 +1383,31 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1417,7 +1416,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1427,7 +1426,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%" }, { @@ -1437,15 +1436,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1454,7 +1453,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1465,11 +1464,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1491,7 +1490,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1506,24 +1505,24 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D1@ / tma_info_core_core= _clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1534,7 +1533,7 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1543,7 +1542,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1558,19 +1557,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1614,8 +1613,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1623,8 +1622,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1632,7 +1631,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1641,7 +1640,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1650,14 +1649,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1670,7 +1669,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1679,7 +1678,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1689,7 +1688,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1698,8 +1697,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1707,7 +1706,7 @@ "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1716,8 +1715,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1726,8 +1725,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1735,8 +1734,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1753,7 +1752,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -1761,31 +1760,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1793,7 +1792,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1802,7 +1801,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1811,8 +1810,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelake/memory.json b/tools/per= f/pmu-events/arch/x86/icelake/memory.json index abaf3f4f9d63..1455aaac37b1 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelake/memory.json @@ -176,6 +176,16 @@ "SampleAfterValue": "50021", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -186,6 +196,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that was not supplie= d by the L3 cache.", "Counter": "0,1,2,3", @@ -196,6 +226,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that was no= t supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -206,6 +256,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that was not supplied by the L3 cache.= ", "Counter": "0,1,2,3", @@ -216,6 +286,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -226,6 +316,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -236,6 +346,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.OTHER.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184008000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -246,6 +376,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.OTHER.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184008000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.STREAMING_WR.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000800", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that was not supplied= by the L3 cache.", "Counter": "0,1,2,3", @@ -256,6 +406,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.STREAMING_WR.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000800", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelake/other.json b/tools/perf= /pmu-events/arch/x86/icelake/other.json index a96b2a989d3f..141cd30a30af 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/other.json +++ b/tools/perf/pmu-events/arch/x86/icelake/other.json @@ -26,186 +26,6 @@ "SampleAfterValue": "200003", "UMask": "0x20" }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that have any type of response.", "Counter": "0,1,2,3", @@ -216,26 +36,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.OTHER.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184008000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.OTHER.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184008000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -245,25 +45,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.STREAMING_WR.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000800", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.STREAMING_WR.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000800", - "SampleAfterValue": "100003", - "UMask": "0x1" } ] --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DBE71B6CE3 for ; Sat, 22 Mar 2025 06:35:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625325; cv=none; b=MMRQuMPSd2XtEgzbu4OQY8QSTYJFMWA8xnti67KmPVSekq7YqPh11O2r/fm8aJU+l53Ji+p2k9jJ/Y5D+ZidYIdU1GidWAI472ty8b/l3iSO42+oDd1tF/aYvSO2JKHyFCHdkBN1K7weP+ORaQMhCNtT4HE0tQtb1vpAVQMlmPs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625325; c=relaxed/simple; bh=asqU3Djvb2tHXwieDpCQ5yTIEnY4hYJNiGyxwIRPdLM=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=VRRRrIcw4PHdzz9Obp5gO8YfKnodRRoYsfS32pAfUrRj4p/ArbMrt0QcXrXeDElDEVt4PqowU8G/EJNBSW6fJvWLLhoJsL4AZy9JF4LH+iaySHrq2OVBunmXelW3wqt5n09EUIt4G7W6HZx62hvjMwPTIUvhsc/obJXV5WQARx0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4oxXMUzq; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4oxXMUzq" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fedcc61536so52226247b3.0 for ; Fri, 21 Mar 2025 23:35:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625306; x=1743230106; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=2UpMjr96wUzX8tbHdnpAMHgCTlp+9vrgLW8XT5ILgPs=; b=4oxXMUzquIjtqS4ti99dg1Oy3ytR7IbC+9K+nKmEKfpovnnmSDLmjhB3vNec87XFMz C9AKgR7ZP9S61Zx6na29cwWF0iEI1q+1AUg95d5zjR9ZikPNM//OxNtFYavavGLrXQZP bqMrR+C7yZfly7R3tiMe5dozUc658siXdmg8eH/Rf17ixJMyUSo0G9q7hxBAWaRgr1Zi wCdD/AEqHm5RKy+uwrM1fbEs0Z8j9m90bbMa/wHpBY4lMSLyN0UOezqqCtSZqO8q2mRb CduD1bXEPj5yeWLxWMSwrJ+JXI/8xrv5tKgTGngCRA3M93egQgvdVqsTMDHFrgmP4boF HGhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625306; x=1743230106; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2UpMjr96wUzX8tbHdnpAMHgCTlp+9vrgLW8XT5ILgPs=; b=i7GXtX2TBUf0Z8bjJDjjZfh+8ThJU3vAZFPqcZe8vQrwb8tMnw4B5iQHw9q3B5AAcH L9C1RQJoNit7ugaht3RnTpR/uT/CnYp1NqXvwYZVpa5+YjrDb08NLs4y1QnrxgAZZoGl Djzo2peiUCx2j3mqj6OguMw2F3MN69dVGVS9wPctZ/Y4fp1Mndv0ziFB3WQGeaLgMXC4 n+sDdlJo83pPNMqIrj3q0q/LUrsqHl5kTXkfNQQ8dZxcUzZdgyhybGesQ8/y+o2azgoQ h0r3nRxDFZTi4uw0vpwg2fXCsr1o1ZuOCcVbNR8Hh1zcFOS2I3OZ2zrX3TpvcWUNT/wS oNYA== X-Forwarded-Encrypted: i=1; AJvYcCV6yz4zwBzG3lPpWA0SVhKVyS0LRlzL7cffArV1SouQYbEN+al6yRjyLuZWHNnBnH/LA5D1/HS1JUW1nkg=@vger.kernel.org X-Gm-Message-State: AOJu0YyqMjD1mXVElQTHj5Y6Z1bG81So3/Ej/gBEUCUyveS+VCzq3mZn BL+L5a+05u4255vL0Qp8wXUiDkrFhNZmR3oC52XBrp+C5Jx5HHBm07/iSDTZ55FezSbNOA8+Ra6 Nwnikog== X-Google-Smtp-Source: AGHT+IHu0hhp5OtGPl84djwfpY9Vp+rRbZJdBTcubq7E8Gf+qUflsrzK1GV+Z26tu4cGTWyF2DhUCYlQ4Qve X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:d688:0:b0:e60:9d6d:4bc1 with SMTP id 3f1490d57ef6-e6690ec2fe5mr17270276.3.1742625306129; Fri, 21 Mar 2025 23:35:06 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:45 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-18-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 17/35] perf vendor events: Update icelakex events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/icelakex/cache.json | 273 +++++++++++ .../arch/x86/icelakex/icx-metrics.json | 399 ++++++++------- .../pmu-events/arch/x86/icelakex/memory.json | 190 +++++++ .../pmu-events/arch/x86/icelakex/other.json | 463 ------------------ 4 files changed, 662 insertions(+), 663 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelakex/cache.json b/tools/per= f/pmu-events/arch/x86/icelakex/cache.json index e8ab6ef2cd50..e46fd6f91d6b 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/cache.json @@ -1,4 +1,67 @@ [ + { + "BriefDescription": "Hit snoop reply with data, line invalidated.", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.I_FWD_FE", + "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated: removed from this core's cache, after the dat= a is forwarded back to the requestor and indicating the data was found unmo= dified in the (FE) Forward or Exclusive State in this cores caches cache. = A single snoop response from the core counts on all hyperthreads of the cor= e.", + "SampleAfterValue": "1000003", + "UMask": "0x20" + }, + { + "BriefDescription": "HitM snoop reply with data, line invalidated.= ", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.I_FWD_M", + "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated: removed from this core's caches, after the da= ta is forwarded back to the requestor, and indicating the data was found mo= dified(M) in this cores caches cache (aka HitM response). A single snoop r= esponse from the core counts on all hyperthreads of the core.", + "SampleAfterValue": "1000003", + "UMask": "0x10" + }, + { + "BriefDescription": "Hit snoop reply without sending the data, lin= e invalidated.", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.I_HIT_FSE", + "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated in this core's caches without being forwarded = back to the requestor. The line was in Forward, Shared or Exclusive (FSE) s= tate in this cores caches. A single snoop response from the core counts on= all hyperthreads of the core.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Line not found snoop reply", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.MISS", + "PublicDescription": "Counts responses to snoops indicating that t= he data was not found (IHitI) in this core's caches. A single snoop respons= e from the core counts on all hyperthreads of the Core.", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Hit snoop reply with data, line kept in Share= d state.", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.S_FWD_FE", + "PublicDescription": "Counts responses to snoops indicating the li= ne may be kept on this core in the (S)hared state, after the data is forwar= ded back to the requestor, initially the data was found in the cache in the= (FS) Forward or Shared state. A single snoop response from the core count= s on all hyperthreads of the core.", + "SampleAfterValue": "1000003", + "UMask": "0x40" + }, + { + "BriefDescription": "HitM snoop reply with data, line kept in Shar= ed state", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.S_FWD_M", + "PublicDescription": "Counts responses to snoops indicating the li= ne may be kept on this core in the (S)hared state, after the data is forwar= ded back to the requestor, initially the data was found in the cache in the= (M)odified state. A single snoop response from the core counts on all hyp= erthreads of the core.", + "SampleAfterValue": "1000003", + "UMask": "0x8" + }, + { + "BriefDescription": "Hit snoop reply without sending the data, lin= e kept in Shared state.", + "Counter": "0,1,2,3", + "EventCode": "0xef", + "EventName": "CORE_SNOOP_RESPONSE.S_HIT_FSE", + "PublicDescription": "Counts responses to snoops indicating the li= ne was kept on this core in the (S)hared state, and that the data was found= unmodified but not forwarded back to the requestor, initially the data was= found in the cache in the (FSE) Forward, Shared state or Exclusive state. = A single snoop response from the core counts on all hyperthreads of the co= re.", + "SampleAfterValue": "1000003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the number of cache lines replaced in = L1 data cache.", "Counter": "0,1,2,3", @@ -506,6 +569,16 @@ "SampleAfterValue": "100003", "UMask": "0x80" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit in the L3 or were snooped from another co= re's caches on the same socket.", "Counter": "0,1,2,3", @@ -546,6 +619,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit in the L3 o= r were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -586,6 +669,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In SN= C Mode counts only those PMM accesses that are controlled by the close SNC = Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703C00001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit a modified line in another c= ore's caches which forwarded the data.", "Counter": "0,1,2,3", @@ -606,6 +709,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit a modified = line in a distant L3 Cache or were snooped from a distant core's L1/L2 cach= es on this socket when the system is in SNC (sub-NUMA cluster) mode.", "Counter": "0,1,2,3", @@ -626,6 +739,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM on a distant memory controller of this socket when the system is in S= NC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SNC_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700800001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -646,6 +779,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM attached to this socket, unless in Sub NUMA Cluster(SNC) Mod= e. In SNC Mode counts only those PMM accesses that are controlled by the c= lose SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.LOCAL_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703C00002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.REMOTE_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit a = modified line in a distant L3 Cache or were snooped from a distant core's L= 1/L2 caches on this socket when the system is in SNC (sub-NUMA cluster) mod= e.", "Counter": "0,1,2,3", @@ -666,6 +829,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM on a distant memory controller of this socket when the syste= m is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SNC_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700800002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that hit in the L3 or were snooped fro= m another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -676,6 +849,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch (which bring data to= L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10070", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L3.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x12380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t hit in the L3 or were snooped from another core's caches on the same sock= et.", "Counter": "0,1,2,3", @@ -686,6 +879,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L3.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x90002380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts full cacheline writes (ItoM) that were= not supplied by the local socket's L1, L2, or L3 caches and the cacheline = was homed in a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.ITOM.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x90000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware and software prefetches to al= l cache levels that hit in the L3 or were snooped from another core's cache= s on the same socket.", "Counter": "0,1,2,3", @@ -696,6 +909,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit in the L3 or were snooped from another core's caches on the sa= me socket.", "Counter": "0,1,2,3", @@ -736,6 +959,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, unless in Sub NUMA C= luster(SNC) Mode. In SNC Mode counts only those PMM accesses that are cont= rolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.LOCAL_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100400477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, whether or not in Su= b NUMA Cluster(SNC) Mode. In SNC Mode counts PMM accesses that are control= led by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700C00477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", "Counter": "0,1,2,3", @@ -766,6 +1019,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.REMOTE_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit a modified line in a distant L3 Cache or were snooped from a d= istant core's L1/L2 caches on this socket when the system is in SNC (sub-NU= MA cluster) mode.", "Counter": "0,1,2,3", @@ -786,6 +1049,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM on a distant memory controller of this socket= when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.SNC_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700800477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that hit in the L3 or= were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/too= ls/perf/pmu-events/arch/x86/icelakex/icx-metrics.json index 7bee03e532e4..a886a0cfee07 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json @@ -335,12 +335,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -352,7 +352,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -375,12 +375,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -395,7 +395,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -403,7 +403,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_stor= e_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_false_sharing = + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -411,22 +411,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_br= anches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tm= a_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms /= (tma_dsb + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 *= tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredicts *= tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_= nukes + tma_core_bound * (tma_serializing_operation + tma_core_bound * RS_E= VENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0) / (tma_di= vider + tma_serializing_operation + tma_ports_utilization) + tma_microcode_= sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (tma_as= sists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_clears_resteers + tma_= mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_= dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_mite + tma_ms)) + 10 *= tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredicts *= tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_= nukes + tma_core_bound * (tma_serializing_operation + tma_core_bound * RS_E= VENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0) / (tma_di= vider + tma_ports_utilization + tma_serializing_operation) + tma_microcode_= sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (tma_as= sists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -434,7 +434,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -442,7 +442,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency + tma_streaming_stores - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -450,7 +450,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -462,17 +462,17 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", "MetricThreshold": "tma_bottleneck_useful_work > 20" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -494,8 +494,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -503,8 +503,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -512,24 +512,24 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -537,7 +537,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -545,33 +545,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((48 * tma_info_system_core_frequency - 4 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DE= MAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_fr= equency - 4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSN= P_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tm= a_info_thread_clks", + "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD)= )) + 43.5 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -581,25 +581,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(47.5 * tma_info_system_core_frequency - 4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", + "MetricExpr": "43.5 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -608,7 +608,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -618,7 +618,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -628,7 +628,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -636,34 +636,34 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "(120 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 48 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricExpr": "(120 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_MISS@offcore_rsp\\=3D0x103b800002@ + 48 * tma_info_system_core_frequ= ency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, @@ -683,7 +683,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -693,7 +693,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -711,7 +711,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -720,15 +720,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -736,7 +736,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -745,7 +745,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -754,7 +754,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -763,7 +763,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -772,7 +772,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -784,17 +784,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -802,8 +802,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -815,28 +815,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -865,7 +865,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -874,7 +874,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -883,10 +883,11 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -947,11 +948,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -964,20 +965,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1013,7 +1014,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1031,7 +1032,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1039,7 +1040,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1047,7 +1048,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1055,7 +1056,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1063,7 +1064,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1071,7 +1072,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1126,7 +1127,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 11", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1257,7 +1258,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1319,8 +1320,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1341,12 +1342,12 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1401,14 +1402,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1429,11 +1429,11 @@ "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", "MetricGroup": "LockCont;MemOffcore;Server;SoC", "MetricName": "tma_info_system_mem_irq_duplicate_address", - "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + "MetricThreshold": "tma_info_system_mem_irq_duplicate_address > 0.= 1" }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1463,7 +1463,7 @@ "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1471,7 +1471,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1479,7 +1479,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1513,7 +1513,7 @@ "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1522,15 +1522,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1540,13 +1539,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1562,14 +1561,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 7.5" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1577,8 +1576,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1586,7 +1585,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1595,7 +1594,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1605,7 +1604,7 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1614,7 +1613,7 @@ "MetricExpr": "4 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1624,17 +1623,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(23 * tma_info_system_core_frequency - 4 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "19 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1642,18 +1641,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1670,7 +1669,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1678,39 +1677,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "(66.5 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "43.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1720,7 +1719,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1736,10 +1735,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1748,7 +1747,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1759,11 +1758,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1785,7 +1784,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1800,24 +1799,24 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D1@ / tma_info_core_core= _clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1828,7 +1827,7 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1837,7 +1836,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1852,19 +1851,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1908,8 +1907,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1917,8 +1916,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1926,7 +1925,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1935,7 +1934,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1944,32 +1943,32 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "((120 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (120 * = tma_info_system_core_frequency - 23 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(97 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_HITM + 97 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "(131 * tma_info_system_core_frequency - 23 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "108 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1982,7 +1981,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1991,7 +1990,7 @@ "MetricExpr": "37 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -2001,7 +2000,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -2010,8 +2009,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -2019,7 +2018,7 @@ "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -2028,8 +2027,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -2038,8 +2037,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -2047,8 +2046,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2065,7 +2064,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -2073,31 +2072,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2105,7 +2104,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2114,7 +2113,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -2123,8 +2122,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelakex/memory.json b/tools/pe= rf/pmu-events/arch/x86/icelakex/memory.json index ec9577cce3ac..ca7f68f67463 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/memory.json @@ -113,6 +113,16 @@ "SampleAfterValue": "50021", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the local socket's L1, L= 2, or L3 caches.", "Counter": "0,1,2,3", @@ -133,6 +143,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -153,6 +193,46 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were n= ot supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -173,6 +253,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that were not supplied by the local so= cket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -193,6 +303,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that were supplied by DRAM attached to= this socket, unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts onl= y those DRAM accesses that are controlled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t missed the local socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -253,6 +373,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -283,6 +413,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x731800477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.READS_TO_CORE.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that missed the local= socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -303,6 +483,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelakex/other.json b/tools/per= f/pmu-events/arch/x86/icelakex/other.json index 05b348d9c838..141cd30a30af 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/other.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/other.json @@ -26,339 +26,6 @@ "SampleAfterValue": "200003", "UMask": "0x20" }, - { - "BriefDescription": "Hit snoop reply with data, line invalidated.", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.I_FWD_FE", - "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated: removed from this core's cache, after the dat= a is forwarded back to the requestor and indicating the data was found unmo= dified in the (FE) Forward or Exclusive State in this cores caches cache. = A single snoop response from the core counts on all hyperthreads of the cor= e.", - "SampleAfterValue": "1000003", - "UMask": "0x20" - }, - { - "BriefDescription": "HitM snoop reply with data, line invalidated.= ", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.I_FWD_M", - "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated: removed from this core's caches, after the da= ta is forwarded back to the requestor, and indicating the data was found mo= dified(M) in this cores caches cache (aka HitM response). A single snoop r= esponse from the core counts on all hyperthreads of the core.", - "SampleAfterValue": "1000003", - "UMask": "0x10" - }, - { - "BriefDescription": "Hit snoop reply without sending the data, lin= e invalidated.", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.I_HIT_FSE", - "PublicDescription": "Counts responses to snoops indicating the li= ne will now be (I)nvalidated in this core's caches without being forwarded = back to the requestor. The line was in Forward, Shared or Exclusive (FSE) s= tate in this cores caches. A single snoop response from the core counts on= all hyperthreads of the core.", - "SampleAfterValue": "1000003", - "UMask": "0x2" - }, - { - "BriefDescription": "Line not found snoop reply", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.MISS", - "PublicDescription": "Counts responses to snoops indicating that t= he data was not found (IHitI) in this core's caches. A single snoop respons= e from the core counts on all hyperthreads of the Core.", - "SampleAfterValue": "1000003", - "UMask": "0x1" - }, - { - "BriefDescription": "Hit snoop reply with data, line kept in Share= d state.", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.S_FWD_FE", - "PublicDescription": "Counts responses to snoops indicating the li= ne may be kept on this core in the (S)hared state, after the data is forwar= ded back to the requestor, initially the data was found in the cache in the= (FS) Forward or Shared state. A single snoop response from the core count= s on all hyperthreads of the core.", - "SampleAfterValue": "1000003", - "UMask": "0x40" - }, - { - "BriefDescription": "HitM snoop reply with data, line kept in Shar= ed state", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.S_FWD_M", - "PublicDescription": "Counts responses to snoops indicating the li= ne may be kept on this core in the (S)hared state, after the data is forwar= ded back to the requestor, initially the data was found in the cache in the= (M)odified state. A single snoop response from the core counts on all hyp= erthreads of the core.", - "SampleAfterValue": "1000003", - "UMask": "0x8" - }, - { - "BriefDescription": "Hit snoop reply without sending the data, lin= e kept in Shared state.", - "Counter": "0,1,2,3", - "EventCode": "0xef", - "EventName": "CORE_SNOOP_RESPONSE.S_HIT_FSE", - "PublicDescription": "Counts responses to snoops indicating the li= ne was kept on this core in the (S)hared state, and that the data was found= unmodified but not forwarded back to the requestor, initially the data was= found in the cache in the (FSE) Forward, Shared state or Exclusive state. = A single snoop response from the core counts on all hyperthreads of the co= re.", - "SampleAfterValue": "1000003", - "UMask": "0x4" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In SN= C Mode counts only those PMM accesses that are controlled by the close SNC = Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703C00001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM on a distant memory controller of this socket when the system is in S= NC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.SNC_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700800001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC0002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM attached to this socket, unless in Sub NUMA Cluster(SNC) Mod= e. In SNC Mode counts only those PMM accesses that are controlled by the c= lose SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.LOCAL_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703C00002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.REMOTE_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by PMM on a distant memory controller of this socket when the syste= m is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.SNC_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700800002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that were supplied by DRAM attached to= this socket, unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts onl= y those DRAM accesses that are controlled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch (which bring data to= L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10070", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L3.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x12380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L3.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x90002380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts full cacheline writes (ItoM) that were= not supplied by the local socket's L1, L2, or L3 caches and the cacheline = was homed in a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.ITOM.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x90000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that have any type of response.", "Counter": "0,1,2,3", @@ -369,126 +36,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC0477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, unless in Sub NUMA C= luster(SNC) Mode. In SNC Mode counts only those PMM accesses that are cont= rolled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.LOCAL_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100400477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x70C000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, whether or not in Su= b NUMA Cluster(SNC) Mode. In SNC Mode counts PMM accesses that are control= led by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700C00477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F33000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x731800477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.REMOTE_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM on a distant memory controller of this socket= when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.READS_TO_CORE.SNC_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700800477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -498,15 +45,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.WRITE_ESTIMATE.MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFBFF80822", - "SampleAfterValue": "100003", - "UMask": "0x1" } ] --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 276B31E7C27 for ; Sat, 22 Mar 2025 06:35:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625315; cv=none; b=r2Oc80vN2cL3Dub2iMlTsqEonssyUzhWNQhzk2julkq3Sj0TfIpppSbr0bq1VkTboMWTS4puitactEv0PHV0evSEHSKqfZHg59qVrQEXBTffMYiM8eikXzaMxuz+9J8mTkRmdrVDiQjX1zoOyDKRklQGNExgCwqbOc+VEmlXqJQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625315; c=relaxed/simple; bh=VCgSUoDyn8HKLqzmgh7KL88mTyxyID3ZHa2mEWl9iLg=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=brlfmDr06G0OtcFWpznP7bgC4sLtYTVWFF8y6zapsbV7DpLWy5ACFuqPDvo6PjjbTKxO9R5zTmdP+1kl4bmleBYa9fLOGNItnJGa5roirawisSLUojjfDrEhn6+0L+8HqOfBFw1X2M3ennv98ITtr5SL3pTiR8UIRMUCgadovI8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=oxPsN0wz; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="oxPsN0wz" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e6432a0dad8so3932048276.1 for ; Fri, 21 Mar 2025 23:35:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625309; x=1743230109; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=ni4SJHLrRP3Qlhbq7D+TPU+znQiai8CoM4wEG4fVdKI=; b=oxPsN0wzBDMWBaGK7wEFccwS2CZY/Nwfdh0BZb1nj902be6aft/ycDgov7gOn45IYN GJ7ZR9FuURAtaGkTa2JkDcwrdy8t30tJ7660uX76pzCGwZtIupjnxxaVSZluaJF+XPb1 WJlwg/c4v9EJpxygwJEfsELvjauMYXqw8Q8ckYN7tE4DQDzx1shcW9gk2iRdRqUUAEjy op1g8g/LvFJlaiRbu/qeGZrUB58GXnjGC4tzxN+BnyMS/I7GUOcQidlJtt7neftaQH4S qUP9bqiyQzbnUNr3Y78jamkKLO4iyT6LJfW7Tx+aqbYQ8rMl+167qideUQr6fKBLAEls Mung== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625309; x=1743230109; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=ni4SJHLrRP3Qlhbq7D+TPU+znQiai8CoM4wEG4fVdKI=; b=nuM9xfw5H/74WtBu1So+oUC20fwQQshk/nmDNJq+uuikWwEcqET364/L70LWMDi7VR Xsq2r4n67ekjLAJUa0OuP6kqjwzLvvPXJ3kEKWSM5wUiQaaeuzq3/FI0CG+2IQWlm82G cnS+W//EEa/2Fo2+v1G78kVvrpR0YxdL1uu/VFHOjaNVQiM3yLWrhPkzX5TPjohDAANa pe13Tc5y0+eotv6/tqdSA13vsvjm77KL1g5qB4IxlZIxqulUWkeLTRYwu6rEwBvAq94k 1Hl75jasSQrPGCRF2mqqa8BQhGoms+k7G2kxdB3MiqEcwEkZZpCbMiWZysM7IOZhb792 3/0w== X-Forwarded-Encrypted: i=1; AJvYcCXEpe81eMvc2D90ust5LnkjMix0qQd0K4+gWCd6/sMoNkJcmSg5ENCLDU0kPTPZtflFw6ufUD+c+zfHyT4=@vger.kernel.org X-Gm-Message-State: AOJu0YzhJskH1Y4WVyssc154tx6jLENKY8lPT4VLhrqXwWY+FP3CrEfy MMNtH8TO3WG1tX1+Wh0dWxVi9WUuijVTMTKa19jGkne1ECVKtTrZFPV1lpWjaU5xL3hFksHSPNX c//CBjw== X-Google-Smtp-Source: AGHT+IHvwinJlbMhYiLzL178dTkg0JCt1CZurTaztxeZL3irBUL9epBecdZEEgn2U5sgX0zcvhxtVcZSlOPA X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:abc4:0:b0:e5b:f74:f5a with SMTP id 3f1490d57ef6-e66a4fe2989mr5888276.8.1742625308637; Fri, 21 Mar 2025 23:35:08 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:46 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-19-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 18/35] perf vendor events: Update ivybridge metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.02. Signed-off-by: Ian Rogers --- .../arch/x86/ivybridge/ivb-metrics.json | 76 +++++++++++++------ .../arch/x86/ivybridge/metricgroups.json | 5 ++ 2 files changed, 56 insertions(+), 25 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json b/to= ols/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json index 77d37db98b70..de651ff9f846 100644 --- a/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json +++ b/tools/perf/pmu-events/arch/x86/ivybridge/ivb-metrics.json @@ -151,7 +151,7 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1= + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD= _UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + M= EM_LOAD_UOPS_RETIRED.LLC_MISS))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP= _MISS * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT = + MEM_LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + = MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSN= P_MISS + MEM_LOAD_UOPS_RETIRED.LLC_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", @@ -184,7 +184,7 @@ "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -236,7 +236,7 @@ { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER= _CORE / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", @@ -246,7 +246,7 @@ "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", @@ -305,7 +305,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -314,7 +314,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -334,7 +334,7 @@ "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -346,7 +346,7 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", @@ -397,6 +397,12 @@ "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", @@ -474,7 +480,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -486,7 +492,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -504,7 +510,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -523,7 +529,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -555,7 +561,7 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "BriefDescription": "", "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" @@ -568,7 +574,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -586,14 +592,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -618,6 +624,19 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 15.6 / (tma_info_system_time = * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -630,6 +649,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -691,12 +717,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -761,7 +787,7 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", @@ -781,7 +807,7 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", @@ -840,7 +866,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -849,7 +875,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -949,7 +975,7 @@ "MetricExpr": "13 * LD_BLOCKS.NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_split_loads > 0.3", "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, @@ -965,7 +991,7 @@ { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", @@ -993,7 +1019,7 @@ "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", diff --git a/tools/perf/pmu-events/arch/x86/ivybridge/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/ivybridge/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/ivybridge/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/ivybridge/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5E0D41E8824 for ; Sat, 22 Mar 2025 06:35:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625317; cv=none; b=IiWdbYSziGWHlJa5qZVUwwhdYiSclyEKRGKnKjNLu1L+zLwGj7MCTUUTJuH26166n1UmezBe27UspQ39WUth23WEZKZKntP4bm3r8i9t/zfgpRd9s/iSd8NNefag5WgJIgaTr7rVezbd6dDreEXkScVqCpMV5whEsHX4MAt/4pg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625317; c=relaxed/simple; bh=CnuQSxx5Yu1wyHO5RngC24AJ1X96YzXrl2hjW0dNQbI=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=O5ykZMUbBeB0QgrTtMNEOvG0MWPHlPSuK2DIpcLsExfmTlTn+IDXIeYPbOEvO9qeYo2uUIGARibcPAZw95yp6z8BE4f5RLhw90INWK9Sz3HXSJ+mReOkI3wwr+Xjh8VTKw3SLktLbhmOx5VVWx7dQL2uhyYsFyRs2gGEjABL4EA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=vDK+UP40; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="vDK+UP40" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fee49e26bcso32309757b3.3 for ; Fri, 21 Mar 2025 23:35:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625311; x=1743230111; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=s+cdTdHDpZF3mMROwANPX8DYhSkxmnBHZWoA/0TRdpI=; b=vDK+UP40+q4ubBmdijGHA5d4zzRLywtPnt/BGNOu7TpIrASUPNyxG7xBheNyfmSgyh RFGumdctgXS1PT/2jgIc1CUCB15VcjKwL/cj5UPp64LXDDJOfXM9osEdzivqljkg7SWl s8UI2xhgoeMtcUvxyMdK+aRNVLabkA/4uWEozvr+Ka3g1zQMIpKxEknIt8oj+IUKty2T BdWLBhUA9VIe8/qyx+UG3cqmt3dUcl/wmZcRSdg/W/aTmswTqyGRM0eKWdkk1kY7fJni PRt14C2sUFnFIUsAf+bUoq61ZIpLMM5Di7FvmLf/tqaELuqsnjtul4aVhmOJkccojqaa Hp+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625311; x=1743230111; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=s+cdTdHDpZF3mMROwANPX8DYhSkxmnBHZWoA/0TRdpI=; b=bzt7LLLvFZBpV+Q1siwJs3+0QeQunqgdcliKkwjf9+ScXbSggK9vxLYN9SGH0Ay75g E1UAPtGhWkBetMtgqmLPuepBbUWFDtOfl80OMEdxjd0gJCYh0mmtoPZPBUBVpOtTgMfg R7TQyxbZP9AjnJgIzZgi/JoanFlYbz+A9ttJMHG+CcPy19BnlBb3LFrWHKaxDQ3+Ni+F dGI7NIp/ga+Hv57LawbDr2m2p0k/YGD6BSabF+SDM8/mN2FLamcemBAgnu9c+RkvNnaI SH8hjYvNLolC2H5vpwZY2QmJhPy7fLZSgdqoYgycYzZyPUjFdL1gRwnFijsgiZG6zYrC Kx8g== X-Forwarded-Encrypted: i=1; AJvYcCX0yVuJ3JAhbNkcr9oodxpKrcvKrgECV7DmIP5mh/SVId64YaCr1arjWA+mRIYBEK3W58WUAq+d34dOjKc=@vger.kernel.org X-Gm-Message-State: AOJu0YzJSRW3+NTHgjz0wY5uz5nDTykJRGj5yjaEDjFFOyFGNsPA7+C/ NE7oixcsOCwmDqdT65eH58F1hIpPxhdRGUHWO+L8heWCrummzS85Rf6WOXbUpcit2zXGVowoe9B aM6/HmQ== X-Google-Smtp-Source: AGHT+IFQm7maxnYzH20p96ymeTVTLJ2XVAbqJoMR25n85GpbiTtWNZ6q9NNqkhfcoLk4gn920dkeXLkUAaJL X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:6801:b0:6fd:29b5:fcb2 with SMTP id 00721157ae682-700bad0acfemr36057b3.5.1742625311110; Fri, 21 Mar 2025 23:35:11 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:47 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-20-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 19/35] perf vendor events: Update ivytown metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.02. Signed-off-by: Ian Rogers --- .../arch/x86/ivytown/ivt-metrics.json | 80 ++++++++++++------- .../arch/x86/ivytown/metricgroups.json | 5 ++ 2 files changed, 58 insertions(+), 27 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json b/tool= s/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json index 8fe0512c938f..714d5e6d21e7 100644 --- a/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json +++ b/tools/perf/pmu-events/arch/x86/ivytown/ivt-metrics.json @@ -151,7 +151,7 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM * (1= + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD= _UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS + M= EM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.R= EMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_LLC= _MISS_RETIRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS= * (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM= _LOAD_UOPS_RETIRED.LLC_HIT + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT + MEM_L= OAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MIS= S + MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETI= RED.REMOTE_DRAM + MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOP= S_LLC_MISS_RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", @@ -184,7 +184,7 @@ "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -236,7 +236,7 @@ { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", @@ -246,7 +246,7 @@ "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", @@ -305,7 +305,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -314,7 +314,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -334,7 +334,7 @@ "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -346,7 +346,7 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", @@ -397,6 +397,12 @@ "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", @@ -474,7 +480,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -486,7 +492,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -504,7 +510,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -523,7 +529,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -555,7 +561,7 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "BriefDescription": "", "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" @@ -568,7 +574,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -586,14 +592,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -627,11 +633,24 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ + power@energy\\-ram@) * 15.6 = / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -644,6 +663,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -652,7 +678,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -711,12 +737,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS. Related me= trics: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -790,7 +816,7 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", @@ -810,7 +836,7 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", @@ -869,7 +895,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -878,7 +904,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -997,7 +1023,7 @@ "MetricExpr": "13 * LD_BLOCKS.NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_split_loads > 0.3", "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, @@ -1013,7 +1039,7 @@ { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", @@ -1041,7 +1067,7 @@ "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", diff --git a/tools/perf/pmu-events/arch/x86/ivytown/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/ivytown/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/ivytown/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/ivytown/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D58B1E9B28 for ; Sat, 22 Mar 2025 06:35:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625319; cv=none; b=r19paEOzT401jbTIRr6QZM2gWke0k7AFn86fO9LriN3BKtVP2xnRyb6NQIgD19XG3NJgvOxt01+hyhfgYWlcXA5ujiwbsrflzoUaCKbMGQZH6lPc1tU7YFGZMv5l6mrU6219fq7cXGkWuQAxDR9p26HC/QKHZ34UeQpFOjP3OQg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625319; c=relaxed/simple; bh=bWiq+5MyDXTElYieLM6C7dh8a0sGhdsKcE3TX6IvQ64=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=XdYoCT2gLxQsHxXEXVsYiCEdhO30WQpRONwWJYFtHYS7fWFxoDuJ4mioHYiJkD9fBi0f3o+myCSsaYBK3L5HIC1L281Uke4dn27pT+wuUrMoVXbtPaWlJUcb4i5/wQbLwN7CD0cy7/B8Qybf9VKlbk3XZos6vNIDwp5+vbR6aSU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ocQhaSS6; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ocQhaSS6" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e643f235aa3so3833054276.2 for ; Fri, 21 Mar 2025 23:35:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625313; x=1743230113; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=CC4DlL0ippkQfeYw3fEnFY+uPfIQ3TXIJOTrdzzBovo=; b=ocQhaSS6/3vxjscdMt1h0Rpl47MZBQusqChj/TVqvCiMGXYYG6uN5bdJf9mAitEBfI gLY5iNmeLPDVcM/9QsMJFEUXR98XpccSRrJn1ujqLqx1hUtckWv4QOUFY3j9mnNkvV+4 doBUvnOdEuonWhYE1bmpOI42XK57Z3HdnddTFfOKyS1f8t6ldEQaMYkTB8n0+FSrkjm3 ex+SJ0gxpGoZXW0lkvD8FfVRZwbgRWJczRh01KRLJuVKuO5RpLRRojSZPxZUYfdzSCD/ +4kesaFPB6M8AaDSruqOnLER0t5O0HesGFaLPjyjl384lpIU1KRMnQ3Ov1KYlbthFI+Y 7kiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625313; x=1743230113; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=CC4DlL0ippkQfeYw3fEnFY+uPfIQ3TXIJOTrdzzBovo=; b=idgtnHEbH4eBzyMhV8EcA01LN8sRwz6uRa0aNSqauESkMws6S83glHn2RprGEwOoDG qNX1C5sg5TnjZtG/0VMRhkhNBrHUGB8zpofdsnWUcn8OXSM5tKPgvFCYUkO+3OokHXbs cILxwJbhNuxOX1DY+Dq5iNXBBt5Z15JCqwPy4bPh+8ItG7J1jd8RDtBy8wSDlSQHZBj9 AURXwnFqjiZWOo1E3xHYajFnnAjL+PbpnyHZrg4VpxshJyxPeYCDPDxHydtwD1qt+4gc qZMp4JKVXpuXEgEqLhbgPIkI98VO1UmhWwalsbQrkvfJ42yHFrVXS4gW1nws3MM2CfLP atZA== X-Forwarded-Encrypted: i=1; AJvYcCWXBHHzXYkChpZlt1mYm1X6US/9SKwUNx5PLoktLbcJFTbjwTOltIRKyH1qGSwikAlBG/jnaUe6p4qFN1I=@vger.kernel.org X-Gm-Message-State: AOJu0YwN8RvfE0NV1uAufyIkwX1Y5FdvimnWbBomUpUWhpV+GAe9fSWX 9vcQQLeL63ri2CNUm6SX7gQjv7nCvBSE3h7jKSTyVDXnZUR3Ev0QEoSnC9mQeI9Nvquh/2PvbMg d0TEalw== X-Google-Smtp-Source: AGHT+IHS0vmRnGzf9rfweWGsLnCD/+4MNQyLgqX5X+F+Kl+bZEC7ClD0Hx4zu5gc1gswKqk4U8NxmQCxlE8C X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:8547:0:b0:e60:89be:c33a with SMTP id 3f1490d57ef6-e66a4a89d22mr3949276.0.1742625313162; Fri, 21 Mar 2025 23:35:13 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:48 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-21-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 20/35] perf vendor events: Update jaketown metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.02. Move INSTS_WRITTEN_TO_IQ.INSTS to the frontend topic. Signed-off-by: Ian Rogers --- .../arch/x86/jaketown/frontend.json | 8 ++++ .../arch/x86/jaketown/jkt-metrics.json | 40 ++++++++++++++----- .../arch/x86/jaketown/metricgroups.json | 5 +++ .../pmu-events/arch/x86/jaketown/other.json | 8 ---- 4 files changed, 43 insertions(+), 18 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/jaketown/frontend.json b/tools/= perf/pmu-events/arch/x86/jaketown/frontend.json index 3cb468da7011..97e7760aeb26 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/frontend.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/frontend.json @@ -278,5 +278,13 @@ "EventName": "IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE", "SampleAfterValue": "2000003", "UMask": "0x1" + }, + { + "BriefDescription": "Valid instructions written to IQ per cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x17", + "EventName": "INSTS_WRITTEN_TO_IQ.INSTS", + "SampleAfterValue": "2000003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json b/too= ls/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json index f8c18741b360..6f636ea0f216 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/jkt-metrics.json @@ -127,7 +127,7 @@ "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -220,7 +220,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -240,7 +240,7 @@ "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -275,6 +275,12 @@ "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 4 > 0.35", "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_lcp" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Total number of retired Instructions", "MetricExpr": "INST_RETIRED.ANY", @@ -290,7 +296,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -308,14 +314,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_mem_bandwidth" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -349,11 +355,18 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -366,6 +379,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -374,7 +394,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -468,7 +488,7 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_info_system_dram_bw_use", diff --git a/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/jaketown/metricgroups.json index 7dc7eb0d3dd3..eb8fbd14138a 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -33,6 +34,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -48,6 +50,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -75,6 +78,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +103,7 @@ "tma_issueSmSt": "Metrics related by the issue $issueSmSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", diff --git a/tools/perf/pmu-events/arch/x86/jaketown/other.json b/tools/per= f/pmu-events/arch/x86/jaketown/other.json index 42692fa24b6c..970839a9c786 100644 --- a/tools/perf/pmu-events/arch/x86/jaketown/other.json +++ b/tools/perf/pmu-events/arch/x86/jaketown/other.json @@ -33,14 +33,6 @@ "SampleAfterValue": "2000003", "UMask": "0x2" }, - { - "BriefDescription": "Valid instructions written to IQ per cycle.", - "Counter": "0,1,2,3", - "EventCode": "0x17", - "EventName": "INSTS_WRITTEN_TO_IQ.INSTS", - "SampleAfterValue": "2000003", - "UMask": "0x1" - }, { "BriefDescription": "Cycles when L1 and L2 are locked due to UC or= split lock.", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C50F81EB5D7 for ; Sat, 22 Mar 2025 06:35:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625337; cv=none; b=lUUOnAokq0EVsASqL+OHWwXXJOTUYDYmhZTccZUDkc+UtQzhoCPB3wNnI4XTjVYRVO+FlZq4+Bb9/Gd2mUQfSIjTtcHJWHDropM/fELcdQN/tCey+FGWSul1yHhu+mo//U8do6sIkCrMC/LmyL0FWdlfE9b7d+g8bYqj9kmi/3g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625337; c=relaxed/simple; bh=/Czwh93Cr1HYyvvd0zF2rmxOC+/u0jE7zz1lBJSkchQ=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Z8Pd6asJwh/q2zCeFuntMdBylDnjRLGq302xn+pl+XlFlyPeM0dyvMKMOH8nW27N/r61uPlWklsfVg3E0/0J8+jDxyrv7Y/W1NDs1l3IOTyktPonWaaV/d1qBykcEYu2gvA/lRfcU97wuy7uKXnAzTT+bnoSx41AqGTTxwSrR4Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=S6dNK6pP; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="S6dNK6pP" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e63f6cdaf27so3644736276.0 for ; Fri, 21 Mar 2025 23:35:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625316; x=1743230116; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=syocn0SO+jBBIkE/o0xIYKpUTldsfeCpYzBj4W1zZEk=; b=S6dNK6pPpRLoNUXWqaHaaUoJ11dQjAYVfUfR0uHHtDOU5LBQwoHPVA5D6nkRkfHZox v5/9VXpJLDYkOlAAVNI6JMPDB5xjlK61zdrfPyYa0rXzThLko90U/omWFv5G9jjjYdB+ wsPFm9lXsc3o5FGxYccaBrK9lV3QiB7PIYh8zDeW2XTIgzBSbXnLBMPb8IFguIhD7vYY SZbR/+65AFPOMqyQ38VOFdQyV5Kx/Je+kX+klK9BjXcbMGESr5xtd+VKanolRCirGhui a0W9fWHK+MbbDN5EpabH4fQNRfl7W5EZnum5sWh8kyO6c59dt6OSGGCY9Pmx+Ik7r6L3 JMdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625316; x=1743230116; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=syocn0SO+jBBIkE/o0xIYKpUTldsfeCpYzBj4W1zZEk=; b=rm9fB1KPtITuxLxC6BiQK5Qd7kkkdfbWimlSrvtWyrKp7WD0foZCcnnrCnV1vuzoom Bb6WOM1ILA44YLT8d65CEC1ZSK75wE0f8ct9roF7cBeQCfnuBJ1wxh5OdBOpyuyH5Xzi DF+KjeGmTSFYfKM6go8LCU0I7oJhWqtDePVT2CtEfJjMzC33w38f/HHLxZACqaKRjsl0 TcP8/yIQQlKQFkVc7rMeYTJ/PmspLSTIJDDWTMA4XUuHPw7JTzSpdmOq6GnrcXjTA7aY 3m89xFRci0o7WkJrluAbJsuVmn8HhJR6tsp9Gj917bI9QkgOGNV/QKOHbUGN9yhEbTf2 Vjdg== X-Forwarded-Encrypted: i=1; AJvYcCVZvqxfjH8mFcoKbSJQr2A59W9aJyZQkPgpQ4e2j5oLDfS8EO7uax+Ub107sHoV+hxo+55uXysLN8NphjE=@vger.kernel.org X-Gm-Message-State: AOJu0YxuD0Ph2KbzRK9TAdp7tNjAUX9aIUFH+hkx5pR+J0mD8SXNPCiy TmB1j0cMn3BmMXLEJnoBaJssLZXBy4iyvidBpGAYvmQIDNIYvSGvVpRWSvgDxCd7uWht+nI4RFk GGSmj3Q== X-Google-Smtp-Source: AGHT+IEiz2/teqRrhYr7Evp0R6gApSMV7rSqVIQNN17w98QH7mUUcv5RItdfUlqFjPyxpf99Ba5jBvXnMNP6 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:ef44:0:b0:e64:e234:ef50 with SMTP id 3f1490d57ef6-e66a4a89f91mr2937276.0.1742625315674; Fri, 21 Mar 2025 23:35:15 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:49 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-22-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 21/35] perf vendor events: Update lunarlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/lunarlake/cache.json | 122 ++++ .../arch/x86/lunarlake/lnl-metrics.json | 556 +++++++++--------- .../pmu-events/arch/x86/lunarlake/memory.json | 44 ++ .../pmu-events/arch/x86/lunarlake/other.json | 353 ----------- .../arch/x86/lunarlake/pipeline.json | 187 ++++++ 5 files changed, 630 insertions(+), 632 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/cache.json index 15fb9921f4fc..4f783e7eb947 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/cache.json @@ -417,6 +417,51 @@ "UMask": "0x22", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of LLC prefetches that were= throttled due to Dynamic Prefetch Throttling. The throttle requestor/sou= rce could be from the uncore/SOC or the Dead Block Predictor. Counts on a p= er core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DPT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to Demand Throttle Prefetcher. DTP Global Triggered with no Local Ove= rride. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches not throt= tled by DTP due to local override. These prefetches may still be throttled= due to another throttler mechanism. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP_OVERRIDE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to LLC hit rate in . Counts on a per core basis= .", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.HIT_RATE", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to exceeding the XQ threshold set by either XQ_THRESOLD_DTP or LLC_XQ_= THRESHOLD. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.XQ_THRESH", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Cycles when L1D is locked", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1183,6 +1228,39 @@ "UMask": "0xf", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts writebacks of modified cachelines that= have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10008", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts writebacks of non-modified cachelines = that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_NONM.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by mem side cache.", "Counter": "0,1,2,3,4,5,6,7", @@ -1194,6 +1272,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", "Counter": "0,1,2,3", @@ -1227,6 +1327,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json b/to= ols/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json index e748f839c4bd..f6c4ffad66b6 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json @@ -89,7 +89,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -99,7 +99,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound >0.10)", + "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%", @@ -111,7 +111,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL_P@ / (8 * cpu_= atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation >0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%", @@ -122,7 +122,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -132,7 +132,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -142,7 +142,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -151,7 +151,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (8 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -160,7 +160,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -170,7 +170,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -179,7 +179,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -189,7 +189,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (8 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound >0.20)", + "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -199,7 +199,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -208,7 +208,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -218,7 +218,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -578,7 +578,7 @@ "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@ / cpu_atom@CPU_CL= K_UNHALTED.CORE@", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", "Unit": "cpu_atom" }, { @@ -617,7 +617,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -626,7 +626,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (8 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", + "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -636,7 +636,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -645,7 +645,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -654,7 +654,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -663,7 +663,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -672,7 +672,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -681,7 +681,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -690,7 +690,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -699,7 +699,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -710,7 +710,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / (8 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "(tma_retiring >0.75)", + "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -720,12 +720,12 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "cpu_core@UOPS_DISPATCHED.ALU@ / (6 * tma_info_threa= d_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -738,13 +738,13 @@ "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -755,7 +755,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -767,18 +767,18 @@ { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-bad\\-spec@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20", @@ -795,16 +795,16 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_capacity + tma_l1_latency_dependency + tma_lock_latency + tma_split_= loads + tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + t= ma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + = tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb_full)= ) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_l1_latency_cap= acity + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bou= nd * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_= fwd_blk + tma_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_la= tency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_store_bou= nd / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sharing + t= ma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_memory_boun= d * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_store_latency / (tma_store_latency + tm= a_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity= + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_latency_dependency = + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bou= nd * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_fb_ful= l + tma_l1_latency_capacity + tma_l1_latency_dependency + tma_lock_latency = + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bou= nd / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_= split_stores + tma_store_latency + tma_streaming_stores)) + tma_memory_boun= d * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_f= alse_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)= ))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -813,16 +813,16 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_= clears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_br= anch_mispredicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_u= nknown_branches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_reste= ers + tma_ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth *= tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_co= de", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * t= ma_ms / (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code= ", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", @@ -830,7 +830,7 @@ }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_s= witches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_rest= eers * tma_other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_r= esteers + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses= + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_= dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ls= d + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / = (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_mic= rocode_sequencer / (tma_microcode_sequencer + tma_few_uops_instructions) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite = + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_bra= nch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_n= ukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cpu_= core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / (t= ma_divider + tma_ports_utilization + tma_serializing_operation) + tma_micro= code_sequencer / (tma_microcode_sequencer + tma_few_uops_instructions) * (t= ma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -839,7 +839,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb= _full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_store_latency + tma_false_sharing + tma_split_stores + tma_streaming_st= ores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fw= d_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bo= und + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency += tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -848,16 +848,16 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -870,11 +870,11 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", "Unit": "cpu_core" }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_microcode_= sequencer + tma_few_uops_instructions) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -883,7 +883,7 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -897,26 +897,26 @@ "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -925,8 +925,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -935,99 +935,99 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@frontend_retired.l1i_miss@R / tma_info_thread_clks - tma_code_l2_miss)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@FRONTEND_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@front= end_retired.l2_miss@R / tma_info_thread_clks", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@FRONT= END_RETIRED.L2_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@frontend_retired.itlb_miss@R / tma_info_thread_clks - tma_code_stlb_mis= s)", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@FRONTEND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_mis= s)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@fro= ntend_retired.stlb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@FRO= NTEND_RETIRED.STLB_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@ / (cpu_core@ITLB_MISSES.= WALK_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_4K@ / (cpu_core@ITLB_MISSES.WAL= K_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@br_misp_retired.cond_ntaken_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@BR_MISP_RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_nt_mispredicts", - "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@ * cpu= _core@br_misp_retired.cond_taken_bwd_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@ * cpu= _core@BR_MISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_bwd_mispredicts", - "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & (tma_bran= ch_mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@ * cpu= _core@br_misp_retired.cond_taken_fwd_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= .", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@ * cpu= _core@BR_MISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_fwd_mispredicts", - "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & (tma_bran= ch_mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ *= cpu_core@mem_load_l3_hit_retired.xsnp_miss@R, cpu_core@MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS@ * (27 * tma_info_system_core_frequency) - 3 * tma_info_syst= em_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_miss@R else= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * (27 * tma_info_system_core_f= requency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM@ * cpu_core@mem_load_l3_hit_retired.xsnp_hitm@R, cpu= _core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * (28 * tma_info_system_core_frequ= ency) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit= _retired.xsnp_hitm@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * (28= * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)) *= (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS= @ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@ * min(cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_HITM@R, 25 * tma_info_system_core_frequency)) * (1 += cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2= ) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1038,17 +1038,17 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@= * cpu_core@mem_load_l3_hit_retired.xsnp_no_fwd@R, cpu_core@MEM_LOAD_L3_HIT= _RETIRED.XSNP_NO_FWD@ * (27 * tma_info_system_core_frequency) - 3 * tma_inf= o_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_no_fw= d@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * (27 * tma_info_sys= tem_core_frequency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@M= EM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * cpu_core@mem_load_l3_hit_retired.xsnp_fw= d@R, cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (28 * tma_info_system_cor= e_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load= _l3_hit_retired.xsnp_fwd@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ = * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequenc= y)) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1= _MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency)) * (1= + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ /= 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1057,7 +1057,7 @@ "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1067,7 +1067,7 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.MEM@ / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1078,7 +1078,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1087,28 +1087,28 @@ "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * cp= u_core@mem_inst_retired.stlb_hit_loads@R, cpu_core@MEM_INST_RETIRED.STLB_HI= T_LOADS@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_loads@R else cpu_co= re@MEM_INST_RETIRED.STLB_HIT_LOADS@ * 7) / tma_info_thread_clks + tma_load_= stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * min(cpu= _core@MEM_INST_RETIRED.STLB_HIT_LOADS@R, 7) / tma_info_thread_clks + tma_lo= ad_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * c= pu_core@mem_inst_retired.stlb_hit_stores@R, cpu_core@MEM_INST_RETIRED.STLB_= HIT_STORES@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_stores@R else cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * 7) / tma_info_thread_clks + tma_= store_stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * min(cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@R, 7) / tma_info_thread_clks + tma_= store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1117,8 +1117,8 @@ "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1128,7 +1128,7 @@ "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1145,12 +1145,12 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1160,7 +1160,7 @@ "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1170,7 +1170,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1180,16 +1180,16 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "cpu_core@ARITH.FPDIV_ACTIVE@ / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1198,8 +1198,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1208,8 +1208,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1218,8 +1218,8 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1228,41 +1228,41 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1271,26 +1271,26 @@ "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@br_misp_retired.indirect_call_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= .", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_call_mispredicts", - "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", - "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@br_misp_retired.indirect_cost@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@br_misp_retired.indirect_call_cost@R) / tma_info_thread= _clks, 0)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions.= ", + "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_COST@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread= _clks, 0)", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_jump_mispredicts", - "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1303,7 +1303,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", @@ -1311,29 +1311,29 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= .", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN_BWD@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate).= ", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN_FWD@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -1357,7 +1357,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -1366,7 +1366,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1375,10 +1375,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", "Unit": "cpu_core" }, { @@ -1444,12 +1445,12 @@ "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.V0@ + cpu_core@FP_ARI= TH_DISPATCHED.V1@ + cpu_core@FP_ARITH_DISPATCHED.V2@ + cpu_core@FP_ARITH_DI= SPATCHED.V3@) / (4 * tma_info_thread_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" @@ -1464,15 +1465,15 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= frontend_retired.any_dsb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= FRONTEND_RETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", "MetricGroup": "DSBmiss;Fed;FetchLat", "MetricName": "tma_info_frontend_dsb_switches_ret", "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", @@ -1480,7 +1481,7 @@ }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" @@ -1530,7 +1531,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@fron= tend_retired.ms_flows@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@FRON= TEND_RETIRED.MS_FLOWS@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat;MicroSeq", "MetricName": "tma_info_frontend_ms_latency_ret", "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", @@ -1545,21 +1546,21 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@frontend_retired.unknown_branch@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@FRONTEND_RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat", "MetricName": "tma_info_frontend_unknown_branches_ret", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", @@ -1579,7 +1580,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", "Unit": "cpu_core" }, { @@ -1588,7 +1589,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1597,7 +1598,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1606,7 +1607,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1615,7 +1616,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1678,7 +1679,7 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 17", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1803,7 +1804,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" @@ -1861,7 +1862,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@mem_inst_retired.stlb_miss_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@MEM_INST_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", @@ -1884,7 +1885,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@mem_inst_retired.stlb_miss_stores@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@MEM_INST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", @@ -1923,20 +1924,20 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1981,23 +1982,22 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.FAR_BR= ANCH:u", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / cpu_core@CPU_CLK_UNHA= LTED.THREAD@", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -2034,7 +2034,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", @@ -2045,7 +2045,6 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { @@ -2053,7 +2052,7 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", "Unit": "cpu_core" }, { @@ -2065,7 +2064,7 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" @@ -2083,15 +2082,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "MetricThreshold": "tma_info_thread_uptb < 12", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2101,7 +2100,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2110,8 +2109,8 @@ "MetricExpr": "cpu_core@INT_VEC_RETIRED.128BIT@ / (tma_retiring * = tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2120,8 +2119,8 @@ "MetricExpr": "cpu_core@INT_VEC_RETIRED.256BIT@ / (tma_retiring * = tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2130,8 +2129,8 @@ "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2140,17 +2139,17 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L1@ / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * cpu_cor= e@mem_load_retired.l1_hit_l1@R, cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * 9) i= f 0 < cpu_core@mem_load_retired.l1_hit_l1@R else cpu_core@MEM_LOAD_RETIRED.= L1_HIT_L1@ * 9) / tma_info_thread_clks", + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache.", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L1_HIT_L1@ * min(cpu_core= @MEM_LOAD_RETIRED.L1_HIT_L1@R, 9) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_capacity", - "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & (tma_l1_bound = > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2159,8 +2158,8 @@ "MetricExpr": "4 * cpu_core@DEPENDENT_LOADS.ANY@ / tma_info_thread= _clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2169,17 +2168,17 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L2@ / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * cpu_core@m= em_load_retired.l2_hit@R, cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info= _system_core_frequency)) if 0 < cpu_core@mem_load_retired.l2_hit@R else cpu= _core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info_system_core_frequency)) * (1= + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ /= 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L2_HIT@R, 3 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2189,18 +2188,18 @@ "MetricExpr": "cpu_core@MEMORY_STALLS.L3@ / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * cpu_core@m= em_load_retired.l3_hit@R, cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 * tma_inf= o_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_c= ore@mem_load_retired.l3_hit@R else cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 = * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) * (= 1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ = / 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L3_HIT@R, 9 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2209,19 +2208,19 @@ "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2231,7 +2230,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2240,7 +2239,7 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2249,43 +2248,43 @@ "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_1G@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_2M_4M@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_C= OMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_4K@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@me= m_inst_retired.lock_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@ME= M_INST_RETIRED.LOCK_LOADS@R / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2296,7 +2295,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2307,17 +2306,17 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2326,34 +2325,34 @@ "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -2376,14 +2375,14 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + cpu_core@IDQ.MITE_UOPS@ / (cpu_core@IDQ.DSB_UO= PS@ + cpu_core@IDQ.MITE_UOPS@) * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.= CORE@ - cpu_core@IDQ_BUBBLES.FETCH_LATENCY@)) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / 2 + cpu_core@IDQ.MITE_UOPS@ / (cpu_core@IDQ.DSB_UOPS@ + cpu_core@IDQ.= MITE_UOPS@) * (cpu_core@IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE@ - cpu_core@ID= Q_BUBBLES.FETCH_LATENCY@)) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2392,17 +2391,17 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", "MetricExpr": "cpu_core@IDQ.MS_CYCLES_ANY@ / tma_info_thread_clks", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", @@ -2415,7 +2414,7 @@ "MetricExpr": "3 * cpu_core@IDQ.MS_SWITCHES@ / tma_info_thread_clk= s", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2426,7 +2425,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2435,7 +2434,7 @@ "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2451,20 +2450,20 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2474,7 +2473,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2483,8 +2482,8 @@ "MetricExpr": "((cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + (cpu_c= ore@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_3_P= ORTS_UTIL@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_ACTIVE@ < cpu_cor= e@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ else= (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVIT= Y.2_3_PORTS_UTIL@) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2493,8 +2492,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ / tma_info= _thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2503,7 +2502,7 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2514,8 +2513,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2525,24 +2524,24 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@br_mi= sp_retired.ret_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@BR_MI= SP_RETIRED.RET_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ret_mispredicts", - "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & (tma_branch_mispr= edicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2556,8 +2555,8 @@ "MetricExpr": "(cpu_core@BE_STALLS.SCOREBOARD@ + cpu_core@CPU_CLK_= UNHALTED.C02@) / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2566,8 +2565,8 @@ "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2577,28 +2576,28 @@ "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * cpu_c= ore@mem_inst_retired.split_loads@R, cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ = * tma_info_memory_load_miss_real_latency) if 0 < cpu_core@mem_inst_retired.= split_loads@R else cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * tma_info_memory= _load_miss_real_latency) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * min(cpu_co= re@MEM_INST_RETIRED.SPLIT_LOADS@R, tma_info_memory_load_miss_real_latency) = / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * cpu_= core@mem_inst_retired.split_stores@R, cpu_core@MEM_INST_RETIRED.SPLIT_STORE= S@) if 0 < cpu_core@mem_inst_retired.split_stores@R else cpu_core@MEM_INST_= RETIRED.SPLIT_STORES@) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * min(cpu_c= ore@MEM_INST_RETIRED.SPLIT_STORES@R, 1) / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2607,8 +2606,8 @@ "MetricExpr": "(cpu_core@XQ.FULL@ + cpu_core@L1D_MISS.L2_STALLS@) = / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2617,8 +2616,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2627,8 +2626,8 @@ "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2637,8 +2636,8 @@ "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2648,7 +2647,6 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2657,7 +2655,7 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2666,34 +2664,34 @@ "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= thread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_1G@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_2M_4M@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_= core@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.W= ALK_COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_4K@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2702,7 +2700,7 @@ "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2712,7 +2710,7 @@ "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2722,8 +2720,8 @@ "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json b/tools/p= erf/pmu-events/arch/x86/lunarlake/memory.json index 60daff922a89..589bd79fe069 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/memory.json @@ -314,6 +314,17 @@ "UMask": "0x4", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache and were su= pplied by the system memory (DRAM, MSC, or MMIO).", "Counter": "0,1,2,3,4,5,6,7", @@ -325,6 +336,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1E780000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache and were supplied by the system memory (DRAM, MSC, or MM= IO).", "Counter": "0,1,2,3,4,5,6,7", @@ -347,6 +380,17 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache and were supplied by the system memory (DRAM, MS= C, or MMIO).", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/other.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/other.json index 667707d4fe37..ad646db01f8c 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/other.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/other.json @@ -18,15 +18,6 @@ "UMask": "0x8", "Unit": "cpu_core" }, - { - "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa2", - "EventName": "BE_STALLS.SCOREBOARD", - "SampleAfterValue": "100003", - "UMask": "0x2", - "Unit": "cpu_core" - }, { "BriefDescription": "Counts the number of unhalted cycles a Core i= s blocked due to a lock In Progress issued by another core", "Counter": "0,1,2,3,4,5,6,7", @@ -65,15 +56,6 @@ "UMask": "0x8", "Unit": "cpu_atom" }, - { - "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x02", - "EventName": "DEPENDENT_LOADS.ANY", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, { "BriefDescription": "Counts the number of cycles the L2 Prefetcher= s are at throttle level 0", "Counter": "0,1,2,3,4,5,6,7", @@ -119,170 +101,6 @@ "UMask": "0x10", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts the number of uops executed on all Int= eger ports.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.ALL", - "SampleAfterValue": "1000003", - "UMask": "0xff", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on a load = port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.LD", - "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 0.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.P0", - "SampleAfterValue": "1000003", - "UMask": "0x8", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 1.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.P1", - "SampleAfterValue": "1000003", - "UMask": "0x10", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 2.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.P2", - "SampleAfterValue": "1000003", - "UMask": "0x20", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 3.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.P3", - "SampleAfterValue": "1000003", - "UMask": "0x40", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.PRIMARY", - "SampleAfterValue": "1000003", - "UMask": "0x78", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on a Store= address port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.STA", - "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb3", - "EventName": "INT_UOPS_EXECUTED.STD_JMP", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of LLC prefetches that were= throttled due to Dynamic Prefetch Throttling. The throttle requestor/sou= rce could be from the uncore/SOC or the Dead Block Predictor. Counts on a p= er core basis.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x29", - "EventName": "LLC_PREFETCHES_THROTTLED.DPT", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of LLC prefetches throttled= due to Demand Throttle Prefetcher. DTP Global Triggered with no Local Ove= rride. Counts on a per core basis.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x29", - "EventName": "LLC_PREFETCHES_THROTTLED.DTP", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of LLC prefetches not throt= tled by DTP due to local override. These prefetches may still be throttled= due to another throttler mechanism. Counts on a per core basis.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x29", - "EventName": "LLC_PREFETCHES_THROTTLED.DTP_OVERRIDE", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of LLC prefetches throttled= due to LLC hit rate in . Counts on a per core basis= .", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x29", - "EventName": "LLC_PREFETCHES_THROTTLED.HIT_RATE", - "SampleAfterValue": "1000003", - "UMask": "0x10", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of LLC prefetches throttled= due to exceeding the XQ threshold set by either XQ_THRESOLD_DTP or LLC_XQ_= THRESHOLD. Counts on a per core basis.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x29", - "EventName": "LLC_PREFETCHES_THROTTLED.XQ_THRESH", - "SampleAfterValue": "1000003", - "UMask": "0x8", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L1", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L2", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.L3", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x46", - "EventName": "MEMORY_STALLS.MEM", - "SampleAfterValue": "1000003", - "UMask": "0x8", - "Unit": "cpu_core" - }, { "BriefDescription": "Counts all requests that have any type of res= ponse.", "Counter": "0,1,2,3,4,5,6,7", @@ -294,127 +112,6 @@ "UMask": "0x1", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts writebacks of modified cachelines that= have any type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.COREWB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10008", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts writebacks of non-modified cachelines = that have any type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.COREWB_NONM.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x11000", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1FBC000004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1FBC000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1E780000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1FBC000002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, { "BriefDescription": "Counts full streaming stores (64 bytes, WCiLF= ) that have any type of response.", "Counter": "0,1,2,3,4,5,6,7", @@ -459,56 +156,6 @@ "UMask": "0x1", "Unit": "cpu_core" }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "1000003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number issue slots not consumed d= ue to a color request for an FCW or MXCSR control register when all 4 colo= rs (copies) are already in use", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.COLOR_STALLS", - "SampleAfterValue": "1000003", - "UMask": "0x8", - "Unit": "cpu_atom" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3,4,5,6,7,8,9", diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/lunarlake/pipeline.json index f4ec7a884937..c6f41f5a5d62 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json @@ -87,6 +87,15 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", @@ -757,6 +766,15 @@ "UMask": "0x4", "Unit": "cpu_core" }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, { "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -992,6 +1010,89 @@ "UMask": "0x10", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of uops executed on all Int= eger ports.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P0", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 1.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P1", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 2.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P2", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P3", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1267,6 +1368,42 @@ "UMask": "0x4", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, { "BriefDescription": "LFENCE instructions retired", "Counter": "0,1,2,3,4,5,6,7,8,9", @@ -1393,6 +1530,56 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number issue slots not consumed d= ue to a color request for an FCW or MXCSR control register when all 4 colo= rs (copies) are already in use", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.COLOR_STALLS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", "Counter": "0,1,2,3,4,5,6,7", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DE8F1EE7B1 for ; Sat, 22 Mar 2025 06:35:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625338; cv=none; b=okXGvS9wryvI7xKXcpdY58N+sPSZPre2ySnxWlpri5zkMK4HbGN7ZynX//GIyz1sJA8Jmw+ZPOTuh8buBrJ2Uji6ImpcaJVXDkSbpPhbKwUZbEpYbSryOWgiOxIe9/V9/7eDwczjHH26m05uXUZjaK+LWyA8K4BiacPAHnv4PZ0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625338; c=relaxed/simple; bh=NQXbyveXRAYQ4oQXVMBUJlSuWwJTuoM3tiEVnhyE1kQ=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=SZ7X7rGI+a4iP9eTgiy9CnR4M3rbaToJcwdb95vYGKmtZiwUJG1DDrPCAI88lhj69QlyhdilZMTrfLyWuEjDNDFz77CsyhdgxMN+Vq0lwhtHfJHaWjSy0yCn+TXLHSsaGB7ZdWye0rFBspyEwkdCZ0+e0xM9A5KqkBO76BsQPw0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0b1L0l8U; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0b1L0l8U" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fcfa304ef4so36337387b3.0 for ; Fri, 21 Mar 2025 23:35:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625318; x=1743230118; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=2knIi3DAIvFucoWfXHgdLu99CrRQPtPV8YUIxeb7YIk=; b=0b1L0l8U6uZpNGmnrgT8U8GO2un+yMH+apAT147KEXwgKlr7CAGL0QE2/l6M4zFkZA 2KzCiStneuOeZ4l/9qVkzhTR9wGr41NJpU0CgMXMZ7Qyefb6NMVkijgAQuXyMKOQkFUe 9VGBf0BnIZ2UYel44m5VaoMaNz5zwK7sIqzMmKOY/1lPKbBKnS2yceiJLa4M2EVeY4A/ UPrwXcikootVKrhCLxJgxYfOulKzs0PMpPaid/OiXMFY+tP5/LijTzI0FiT8kK8ria6v q0LIHRhDim8vfiZRj3/e1ewOsIDHfURkvyNAFO8iAbnRbPYkzdOek2hOgywcgQdHQa8B wJKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625318; x=1743230118; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2knIi3DAIvFucoWfXHgdLu99CrRQPtPV8YUIxeb7YIk=; b=Cl8kUUxBQrec4+P0K8JUKJ7JxdPub4+XgHz31YymaxaWy1q/XA0+LvdmeXAYWBEar8 Wiy51V95nI9GBRtj6U4t/ZyuekOvbDpo03BMPAaQKQiXBNlxdlLaQnr6EI/6R1jqgw8V MhKNSzemBqhC6FOyQ3bIEKVx2+yFORQv3gF0qXamGvbo3DIYx7k2E2S+/37m+/8+Y4Db Y78+9j2/QlZ7MjaiChkuXw1+ZNx8d2nFTPvGxJLNihcESJMMjR0Gg3vFIAssHksAujGX SIfpFr3AgCPCXmL7TC+8Av29wsPeSLyZIGeQpD+1w5fqZ8BVtotMeE+HBUu6XK7E1LFl u8qA== X-Forwarded-Encrypted: i=1; AJvYcCVQJErAe5ahfVqCHaQb1rv2HKj8C8Bplk/bfxwH+MCDMHQpMu8CAJLnewoEtBQOq9t4iPLqyzJjKqt+pGA=@vger.kernel.org X-Gm-Message-State: AOJu0YxWfFIsT7yPjIPevSvivxUIx9i5koAcHEGozqIYX82fAEYgAUu+ Hyo2/s52D0ror/qNiK2BAlek4iekiQWYSpx8AZrZ+APauSDHParnNppdSjkMjJpsh/pn5xnCI+i O4t3Cpw== X-Google-Smtp-Source: AGHT+IHyW0HV6eZN5T0k4iQ2kEQyTf4WuZIwT9beqgdczDC8bbk2JfDBAqIuRe+Py0dgD4bxYahBfGai79hH X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:6088:b0:700:5107:c9fe with SMTP id 00721157ae682-700ba4c72c0mr470097b3.4.1742625318036; Fri, 21 Mar 2025 23:35:18 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:50 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-23-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 22/35] perf vendor events: Update meteorlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.12 to v1.13. Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/meteorlake/cache.json | 179 ++++++ .../arch/x86/meteorlake/memory.json | 44 ++ .../arch/x86/meteorlake/mtl-metrics.json | 549 +++++++++--------- .../pmu-events/arch/x86/meteorlake/other.json | 140 ----- .../arch/x86/meteorlake/pipeline.json | 44 +- .../arch/x86/meteorlake/uncore-memory.json | 18 + 7 files changed, 559 insertions(+), 417 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 579b4fbd65d6..0c16c9b840a5 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -23,7 +23,7 @@ GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core GenuineIntel-6-(57|85),v16,knightslanding,core GenuineIntel-6-BD,v1.11,lunarlake,core -GenuineIntel-6-(AA|AC|B5),v1.12,meteorlake,core +GenuineIntel-6-(AA|AC|B5),v1.13,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/cache.json index ce351cd7caaf..7f455864b1a7 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/cache.json @@ -1,4 +1,14 @@ [ + { + "BriefDescription": "Counts the number of L1D cacheline (dirty) ev= ictions caused by load misses, stores, and prefetches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x51", + "EventName": "DL1.DIRTY_EVICTION", + "PublicDescription": "Counts the number of L1D cacheline (dirty) e= victions caused by load misses, stores, and prefetches. Does not count evi= ctions or dirty writebacks caused by snoops. Does not count a replacement = unless a (dirty) line was written back.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "L1D.HWPF_MISS", "Counter": "0,1,2,3", @@ -81,6 +91,56 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Exclusive state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.E", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Exclusive state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Forward state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.F", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Forward state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Modified state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.M", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Modified state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Shared state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.S", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Shared state. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = evicted due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= evicted due to an L2 cache fill. Increments on the core that brought the l= ine in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, { "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", "Counter": "0,1,2,3", @@ -91,6 +151,16 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = silently dropped due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= silently dropped due to an L2 cache fill. Increments on the core that bro= ught the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", @@ -121,6 +191,15 @@ "UMask": "0xff", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that r= esulted in a Hit from a front door request only (does not include rejects o= r recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, { "BriefDescription": "All requests that hit L2 cache. [This event i= s alias to L2_RQSTS.HIT]", "Counter": "0,1,2,3", @@ -131,6 +210,15 @@ "UMask": "0xdf", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of total L2 Cache Accesses = that resulted in a Miss from a front door request only (does not include re= jects or recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", "Counter": "0,1,2,3", @@ -141,6 +229,15 @@ "UMask": "0x3f", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that m= iss the L2 and get BBL reject short and long rejects (includes those count= ed in L2_reject_XQ.any), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.REJECTS", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "L2 code requests", "Counter": "0,1,2,3", @@ -398,6 +495,15 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an ICACHE or ITLB miss which hit in the LLC. If the= core has access to an L3 cache, an LLC hit refers to an L3 cache hit, othe= rwise it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", @@ -435,6 +541,15 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC. If the cor= e has access to an L3 cache, an LLC hit refers to an L3 cache hit, otherwis= e it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", @@ -453,6 +568,15 @@ "UMask": "0x78", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled to a store buffer full condition", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.SBFULL", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, { "BriefDescription": "Retired load instructions.", "Counter": "0,1,2,3", @@ -1054,6 +1178,17 @@ "UMask": "0x3", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", @@ -1098,6 +1233,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", @@ -1164,6 +1321,28 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json b/tools/= perf/pmu-events/arch/x86/meteorlake/memory.json index e4481fbc1e13..8f07575da9f0 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/memory.json @@ -294,6 +294,17 @@ "UMask": "0x4", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", @@ -305,6 +316,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", @@ -327,6 +360,17 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json b/t= ools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json index 20c52630127e..89111cfcf3ae 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json @@ -75,7 +75,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -85,7 +85,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "(tma_backend_bound >0.10)", + "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%", @@ -97,7 +97,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL_P@ / (6 * cpu_= atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "(tma_bad_speculation >0.15)", + "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%", @@ -108,7 +108,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -118,7 +118,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", + "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -128,7 +128,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -137,7 +137,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (6 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -146,7 +146,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -156,7 +156,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -165,7 +165,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (6 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -175,7 +175,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "(tma_frontend_bound >0.20)", + "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -185,7 +185,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -194,7 +194,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -204,7 +204,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (6 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -564,7 +564,7 @@ "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@ / cpu_atom@CPU_CL= K_UNHALTED.CORE@", "MetricName": "tma_info_system_mux", - "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", "Unit": "cpu_atom" }, { @@ -603,7 +603,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -612,7 +612,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (6 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", + "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -622,7 +622,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -631,7 +631,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -640,7 +640,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (6 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -649,7 +649,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -658,7 +658,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -667,7 +667,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (6 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -676,7 +676,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -685,7 +685,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -696,7 +696,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "(tma_retiring >0.75)", + "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%", "Unit": "cpu_atom" @@ -706,7 +706,7 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -718,7 +718,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -731,13 +731,13 @@ "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -748,7 +748,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -765,13 +765,13 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20", @@ -788,16 +788,16 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependen= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_= bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma= _l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_fb= _full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tm= a_store_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_= latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_s= tore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)) + tma_mem= ory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store= + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming= _stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -806,16 +806,16 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_= clears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_br= anch_mispredicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_u= nknown_branches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_reste= ers + tma_ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth *= tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_co= de", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * t= ma_ms / (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code= ", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", @@ -823,7 +823,7 @@ }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_s= witches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_rest= eers * tma_other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_r= esteers + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses= + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_= dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ls= d + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / = (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite = + tma_ms)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_bra= nch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_n= ukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cpu_= core@RS.EMPTY_RESOURCE@ / tma_info_thread_clks * tma_ports_utilized_0) / (t= ma_divider + tma_ports_utilization + tma_serializing_operation) + tma_micro= code_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (t= ma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -832,7 +832,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + = tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound= * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false= _sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -841,16 +841,16 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", "Unit": "cpu_core" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -863,11 +863,11 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", "Unit": "cpu_core" }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -876,7 +876,7 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -890,26 +890,26 @@ "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -918,7 +918,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -928,90 +928,90 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@frontend_retired.l1i_miss@R / tma_info_thread_clks - tma_code_l2_miss)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.L1I_MISS@ * cpu_co= re@FRONTEND_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@front= end_retired.l2_miss@R / tma_info_thread_clks", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.L2_MISS@ * cpu_core@FRONT= END_RETIRED.L2_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", - "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@frontend_retired.itlb_miss@R / tma_info_thread_clks - tma_code_stlb_mis= s)", + "MetricExpr": "max(0, cpu_core@FRONTEND_RETIRED.ITLB_MISS@ * cpu_c= ore@FRONTEND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_mis= s)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@fro= ntend_retired.stlb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.STLB_MISS@ * cpu_core@FRO= NTEND_RETIRED.STLB_MISS@R / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@ / (cpu_core@ITLB_MISSES.= WALK_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "cpu_core@ITLB_MISSES.WALK_ACTIVE@ / tma_info_thread= _clks * cpu_core@ITLB_MISSES.WALK_COMPLETED_4K@ / (cpu_core@ITLB_MISSES.WAL= K_COMPLETED_4K@ + cpu_core@ITLB_MISSES.WALK_COMPLETED_2M_4M@)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@br_misp_retired.cond_ntaken_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_NTAKEN_COST@ * cpu_co= re@BR_MISP_RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_nt_mispredicts", - "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_COST@ * cpu_cor= e@br_misp_retired.cond_taken_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.COND_TAKEN_COST@ * cpu_cor= e@BR_MISP_RETIRED.COND_TAKEN_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_cond_tk_mispredicts", - "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & (tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ *= cpu_core@mem_load_l3_hit_retired.xsnp_miss@R, cpu_core@MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS@ * (27 * tma_info_system_core_frequency) - 3 * tma_info_syst= em_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_miss@R else= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * (27 * tma_info_system_core_f= requency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@MEM_LOAD_L3= _HIT_RETIRED.XSNP_FWD@ * cpu_core@mem_load_l3_hit_retired.xsnp_fwd@R, cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (28 * tma_info_system_core_frequenc= y) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit_re= tired.xsnp_fwd@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (28 * tm= a_info_system_core_frequency) - 3 * tma_info_system_core_frequency) * (cpu_= core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L= 3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@)= )) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_= MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM_LOA= D_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency) * (cpu_co= re@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_= HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@)))= * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MI= SS@ / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1022,26 +1022,26 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "((min(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@= * cpu_core@mem_load_l3_hit_retired.xsnp_no_fwd@R, cpu_core@MEM_LOAD_L3_HIT= _RETIRED.XSNP_NO_FWD@ * (27 * tma_info_system_core_frequency) - 3 * tma_inf= o_system_core_frequency) if 0 < cpu_core@mem_load_l3_hit_retired.xsnp_no_fw= d@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * (27 * tma_info_sys= tem_core_frequency) - 3 * tma_info_system_core_frequency) + (min(cpu_core@M= EM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * cpu_core@mem_load_l3_hit_retired.xsnp_fw= d@R, cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (27 * tma_info_system_cor= e_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@mem_load= _l3_hit_retired.xsnp_fwd@R else cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ = * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequenc= y) * (1 - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DE= MAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_= HIT_WITH_FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LO= AD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 24 * tma_info_system_core_frequency) * (1 = - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DAT= A_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH= _FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIR= ED.L1_MISS@ / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1051,7 +1051,7 @@ "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1061,7 +1061,7 @@ "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -1072,7 +1072,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1081,28 +1081,28 @@ "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * cp= u_core@mem_inst_retired.stlb_hit_loads@R, cpu_core@MEM_INST_RETIRED.STLB_HI= T_LOADS@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_loads@R else cpu_co= re@MEM_INST_RETIRED.STLB_HIT_LOADS@ * 7) / tma_info_thread_clks + tma_load_= stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * min(cpu= _core@MEM_INST_RETIRED.STLB_HIT_LOADS@R, 7) / tma_info_thread_clks + tma_lo= ad_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * c= pu_core@mem_inst_retired.stlb_hit_stores@R, cpu_core@MEM_INST_RETIRED.STLB_= HIT_STORES@ * 7) if 0 < cpu_core@mem_inst_retired.stlb_hit_stores@R else cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * 7) / tma_info_thread_clks + tma_= store_stlb_miss", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * min(cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@R, 7) / tma_info_thread_clks + tma_= store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1111,8 +1111,8 @@ "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1122,7 +1122,7 @@ "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1133,18 +1133,18 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - cpu_core@I= NT_MISC.UOP_DROPPING@ / tma_info_thread_slots", + "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1164,7 +1164,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1174,16 +1174,16 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "cpu_core@ARITH.FPDIV_ACTIVE@ / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1192,8 +1192,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1202,8 +1202,8 @@ "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1212,8 +1212,8 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1222,41 +1222,41 @@ "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - cpu_core@IN= T_MISC.UOP_DROPPING@ / tma_info_thread_slots", + "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1265,26 +1265,26 @@ "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@br_misp_retired.indirect_call_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= .", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_call_mispredicts", - "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", - "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@br_misp_retired.indirect_cost@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@br_misp_retired.indirect_call_cost@R) / tma_info_thread= _clks, 0)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions.= ", + "MetricExpr": "max((cpu_core@BR_MISP_RETIRED.INDIRECT_COST@ * cpu_= core@BR_MISP_RETIRED.INDIRECT_COST@R - cpu_core@BR_MISP_RETIRED.INDIRECT_CA= LL_COST@ * cpu_core@BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread= _clks, 0)", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ind_jump_mispredicts", - "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & (tma_branch_= mispredicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -1297,7 +1297,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", @@ -1305,7 +1305,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", @@ -1313,15 +1313,15 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -1353,7 +1353,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -1362,7 +1362,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1371,10 +1371,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", "Unit": "cpu_core" }, { @@ -1445,12 +1446,12 @@ "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" @@ -1465,15 +1466,15 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= frontend_retired.any_dsb_miss@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.ANY_DSB_MISS@ * cpu_core@= FRONTEND_RETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", "MetricGroup": "DSBmiss;Fed;FetchLat", "MetricName": "tma_info_frontend_dsb_switches_ret", "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", @@ -1481,7 +1482,7 @@ }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" @@ -1531,7 +1532,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@fron= tend_retired.ms_flows@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.MS_FLOWS@ * cpu_core@FRON= TEND_RETIRED.MS_FLOWS@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat;MicroSeq", "MetricName": "tma_info_frontend_ms_latency_ret", "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", @@ -1546,21 +1547,21 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", - "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@frontend_retired.unknown_branch@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@FRONTEND_RETIRED.UNKNOWN_BRANCH@ * cpu_cor= e@FRONTEND_RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", "MetricGroup": "Fed;FetchLat", "MetricName": "tma_info_frontend_unknown_branches_ret", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", @@ -1580,7 +1581,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", "Unit": "cpu_core" }, { @@ -1589,7 +1590,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1598,7 +1599,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1607,7 +1608,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1616,7 +1617,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", "Unit": "cpu_core" }, { @@ -1679,7 +1680,7 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1825,7 +1826,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" @@ -1883,7 +1884,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@mem_inst_retired.stlb_miss_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_LOADS@ * cpu_co= re@MEM_INST_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", @@ -1906,7 +1907,7 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@mem_inst_retired.stlb_miss_stores@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_MISS_STORES@ * cpu_c= ore@MEM_INST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", @@ -1921,7 +1922,7 @@ }, { "BriefDescription": "", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D0x1@)", + "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" @@ -1952,20 +1953,20 @@ "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -2018,23 +2019,22 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / BR_INST_RETIRED.FAR_BR= ANCH:u", + "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / cpu_core@CPU_CLK_UNHA= LTED.THREAD@", + "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -2042,7 +2042,7 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @thresh\\=3D0x1@", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", @@ -2093,7 +2093,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", @@ -2104,7 +2104,6 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { @@ -2112,7 +2111,7 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", "Unit": "cpu_core" }, { @@ -2124,14 +2123,14 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" @@ -2149,15 +2148,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "MetricThreshold": "tma_info_thread_uptb < 9", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2167,7 +2166,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2176,8 +2175,8 @@ "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2186,8 +2185,8 @@ "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2196,8 +2195,8 @@ "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2206,7 +2205,7 @@ "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2216,7 +2215,7 @@ "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2226,17 +2225,17 @@ "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * cpu_core@m= em_load_retired.l2_hit@R, cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info= _system_core_frequency)) if 0 < cpu_core@mem_load_retired.l2_hit@R else cpu= _core@MEM_LOAD_RETIRED.L2_HIT@ * (3 * tma_info_system_core_frequency)) * (1= + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ /= 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L2_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L2_HIT@R, 3 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2246,18 +2245,18 @@ "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(min(cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * cpu_core@m= em_load_retired.l3_hit@R, cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 * tma_inf= o_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cpu_c= ore@mem_load_retired.l3_hit@R else cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * (12 = * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) * (= 1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ = / 2) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L3_HIT@R, 9 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2266,19 +2265,19 @@ "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2297,7 +2296,7 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2306,43 +2305,43 @@ "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_1G@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_2M_4M@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_C= OMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * cpu_core@DTLB_LOAD_MISSES.WALK= _COMPLETED_4K@ / (cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED_4K@ + cpu_core@D= TLB_LOAD_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_LOAD_MISSES.WALK_COMP= LETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@me= m_inst_retired.lock_loads@R / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@ME= M_INST_RETIRED.LOCK_LOADS@R / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2353,7 +2352,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2364,17 +2363,17 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2383,34 +2382,34 @@ "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -2433,7 +2432,7 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2449,18 +2448,18 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "max(cpu_core@IDQ.MS_CYCLES_ANY@, cpu_core@UOPS_RETI= RED.MS\\,cmask\\=3D0x1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISS= UED.ANY@)) / tma_info_core_core_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "max(cpu_core@IDQ.MS_CYCLES_ANY@, cpu_core@UOPS_RETI= RED.MS\\,cmask\\=3D1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUE= D.ANY@)) / tma_info_core_core_clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -2469,10 +2468,10 @@ }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tm= a_info_thread_clks", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2483,7 +2482,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2492,7 +2491,7 @@ "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2508,20 +2507,20 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2531,7 +2530,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2541,7 +2540,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2551,7 +2550,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2561,7 +2560,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2570,8 +2569,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= 3_PORTS_UTIL@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_ACTIVE@ < cpu_= core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ e= lse (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTI= VITY.2_3_PORTS_UTIL@) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2580,8 +2579,8 @@ "MetricExpr": "max(cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ - cpu_= core@RESOURCE_STALLS.SCOREBOARD@, 0) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2590,7 +2589,7 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2601,8 +2600,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2612,24 +2611,24 @@ "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", - "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@br_mi= sp_retired.ret_cost@R / tma_info_thread_clks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s.", + "MetricExpr": "cpu_core@BR_MISP_RETIRED.RET_COST@ * cpu_core@BR_MI= SP_RETIRED.RET_COST@R / tma_info_thread_clks", "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", "MetricName": "tma_ret_mispredicts", - "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & (tma_branch_mispr= edicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2643,7 +2642,7 @@ "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2653,8 +2652,8 @@ "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2664,28 +2663,28 @@ "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * cpu_c= ore@mem_inst_retired.split_loads@R, cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ = * tma_info_memory_load_miss_real_latency) if 0 < cpu_core@mem_inst_retired.= split_loads@R else cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * tma_info_memory= _load_miss_real_latency) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * min(cpu_co= re@MEM_INST_RETIRED.SPLIT_LOADS@R, tma_info_memory_load_miss_real_latency) = / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "(min(cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * cpu_= core@mem_inst_retired.split_stores@R, cpu_core@MEM_INST_RETIRED.SPLIT_STORE= S@) if 0 < cpu_core@mem_inst_retired.split_stores@R else cpu_core@MEM_INST_= RETIRED.SPLIT_STORES@) / tma_info_thread_clks", + "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * min(cpu_c= ore@MEM_INST_RETIRED.SPLIT_STORES@R, 1) / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2694,8 +2693,8 @@ "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2704,8 +2703,8 @@ "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2714,8 +2713,8 @@ "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2724,8 +2723,8 @@ "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2744,7 +2743,7 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2753,34 +2752,34 @@ "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_1G@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_2M_4M@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_= core@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.W= ALK_COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * cpu_core@DTLB_STORE_MISSES.WA= LK_COMPLETED_4K@ / (cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED_4K@ + cpu_cor= e@DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M@ + cpu_core@DTLB_STORE_MISSES.WALK= _COMPLETED_1G@)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2789,7 +2788,7 @@ "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2799,7 +2798,7 @@ "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2809,8 +2808,8 @@ "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/other.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/other.json index 46a21776a4e9..4d64bedb3e8c 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/other.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/other.json @@ -28,105 +28,6 @@ "UMask": "0x1", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1", - "Unit": "cpu_atom" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3,4,5,6,7", @@ -171,47 +72,6 @@ "UMask": "0x1", "Unit": "cpu_core" }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4", - "Unit": "cpu_atom" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json b/tool= s/perf/pmu-events/arch/x86/meteorlake/pipeline.json index 265f6c5a0248..e98a0324a6a6 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json @@ -1133,8 +1133,9 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "BriefDescription": "This event is deprecated.", "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", "EventCode": "0xc3", "EventName": "MACHINE_CLEARS.SLOW", "SampleAfterValue": "20003", @@ -1208,6 +1209,47 @@ "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/meteorlake/uncore-memory.json index 783a4f7fd05b..ceb8839f0767 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/uncore-memory.json @@ -99,6 +99,24 @@ "PerPkg": "1", "Unit": "iMC" }, + { + "BriefDescription": "Any Rank at Hot state", + "Counter": "0,1,2,3,4", + "EventCode": "0x19", + "EventName": "UNC_M_DRAM_THERMAL_HOT", + "Experimental": "1", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Any Rank at Warm state", + "Counter": "0,1,2,3,4", + "EventCode": "0x1A", + "EventName": "UNC_M_DRAM_THERMAL_WARM", + "Experimental": "1", + "PerPkg": "1", + "Unit": "iMC" + }, { "BriefDescription": "PRE command sent to DRAM due to page table id= le timer expiration", "Counter": "0,1,2,3,4", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC5BE1EEA3C for ; Sat, 22 Mar 2025 06:35:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625325; cv=none; b=NLiYYWj6GDZymghb/zuaoaekUX90YtMPx+l6leE8ji2ISQ1SOkTHbdGqF0ST6vdwTuf/E9kevHxFq6MwOdkE+AiWBWrioN0r4ghYsFp6lo4NAzg/3Tfm+uGpQXVEdJBKdMzZEcHeLZQOPGkMLSjin0rR6ilhmRLW3LGuUKjIq1A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625325; c=relaxed/simple; bh=8ECLI9MH8G6GS+TQuT+0Q+DWep0ey0sdDAegVmCJBw4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=U6fdMNpVMdd/2ru4hG8iOKPzD+rY9zrxFWMGNho8xDr0jt0DLIU5TbHI0P9QNIaCbsGTbUI4WeoVsz8dgKp95+cxbkJskULbNhHkP5iNLFsKweLG3JSvTbev8ZjCzQbbqlCBm0Lg742+TBeSexMb03M69eunTLI9Ihi+ssjj8XI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hL2ukoSA; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hL2ukoSA" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6fec94421f8so37863367b3.0 for ; Fri, 21 Mar 2025 23:35:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625320; x=1743230120; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=NTidQqUKCPDZIA1M8BUtNh4exn5q3fZouzrUCwjO3OM=; b=hL2ukoSAbFoh0T1H3pNTUBluThMNL43CyR6DNi9TcEYkcphKUkJqlF4DPUsgIkrbbK q0/cpLmM/dFFYllfP/W15vAZmGLiDYoetatrtDywvJvYl6bDlBXt0xxUwToK6h+/fRuz rCDADDEll/UhTPJXhUeQIyAXq0WjoOzriVWbyZvqfTXzSkvu3uvgRRzOE+9/J9MPPL0o Abt8uML9n+dVE3da/r5aGSFJXBmso3k3GQV/bOeCrtM50FyD6CCzFEF3CZC6XS2nSaAn tl6BKg/slsstEKcYKqnygiO+jbg3V5MGRY/0gGd22MEvTi4l5irOJlwPi1faCrxpT7yq DDAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625320; x=1743230120; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NTidQqUKCPDZIA1M8BUtNh4exn5q3fZouzrUCwjO3OM=; b=USsONJoc5jvSsRZqit16Qz7yVMsB3u4gTgo7DcCWvB4fvPmfkwExJsGM3hsSQ7GVQ9 fcp/ztmljTTMoezf51aONb8oERZ25ZfUwxa+VCcS2R5X1iMS+Yxp0/OAYVzz8eA47pMg mr6Mye8EvdjFHiFpEtDyJTcZ1ZLEfbXAL3JxgUgZzQN3eoa0bKJofJg93pScnpwyHkTo BvoRhWCGLXcYiY8Mv9iqQxNFeZdccaI8t1FR2U87MJ674XtxXuQ5KIWXZ3KplHSzdG6o BHr3TSh8yuQioofvK2m7f5IMt0ztheJ81maqPe2pb2RvXgW5vMSoDkLMtEWLsmplZZZ0 p1rw== X-Forwarded-Encrypted: i=1; AJvYcCVy/zen5r9ndgwdUlyRle9KguMXkD2/9goZtrbudCISFyU/F2T57Nsk2sw/l+RKBGsCNRAO48QVE9yen3g=@vger.kernel.org X-Gm-Message-State: AOJu0YxdrqIptxuC6JZHyiHHgSaGkYDAyJkGYArjJx95BgCZT/ND/A7n j9RGglMU2f+MhdPoilTXUX6rkSnk64xvbQFLZGy+0Es82BzTP4tKHFwQUQPmQ66TM0Z4o/gYB5d 5qzMe+w== X-Google-Smtp-Source: AGHT+IHEsM8kxSNExm7drga3vJyZ45993Hm8XS45hdWCxlaPPb50fmcZUr1lIFkO77tC66DDN5nXw8NchStZ X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:2f0e:b0:6ff:1fac:c4ec with SMTP id 00721157ae682-700bad55687mr298257b3.7.1742625320483; Fri, 21 Mar 2025 23:35:20 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:51 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-24-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 23/35] perf vendor events: Update nehalemep events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and virtual memory. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/nehalemep/cache.json | 32 +++++++++++++++ .../pmu-events/arch/x86/nehalemep/other.json | 40 ------------------- .../arch/x86/nehalemep/virtual-memory.json | 8 ++++ 3 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/nehalemep/cache.json b/tools/pe= rf/pmu-events/arch/x86/nehalemep/cache.json index b90026df2ce7..c9d154f1d09a 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemep/cache.json +++ b/tools/perf/pmu-events/arch/x86/nehalemep/cache.json @@ -239,6 +239,38 @@ "SampleAfterValue": "100000", "UMask": "0x2" }, + { + "BriefDescription": "L1I instruction fetch stall cycles", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.CYCLES_STALLED", + "SampleAfterValue": "2000000", + "UMask": "0x4" + }, + { + "BriefDescription": "L1I instruction fetch hits", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.HITS", + "SampleAfterValue": "2000000", + "UMask": "0x1" + }, + { + "BriefDescription": "L1I instruction fetch misses", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.MISSES", + "SampleAfterValue": "2000000", + "UMask": "0x2" + }, + { + "BriefDescription": "L1I Instruction fetches", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.READS", + "SampleAfterValue": "2000000", + "UMask": "0x3" + }, { "BriefDescription": "All L2 data requests", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/nehalemep/other.json b/tools/pe= rf/pmu-events/arch/x86/nehalemep/other.json index f6887b234b0e..5fe5ca778e9f 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemep/other.json +++ b/tools/perf/pmu-events/arch/x86/nehalemep/other.json @@ -15,46 +15,6 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, - { - "BriefDescription": "L1I instruction fetch stall cycles", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.CYCLES_STALLED", - "SampleAfterValue": "2000000", - "UMask": "0x4" - }, - { - "BriefDescription": "L1I instruction fetch hits", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.HITS", - "SampleAfterValue": "2000000", - "UMask": "0x1" - }, - { - "BriefDescription": "L1I instruction fetch misses", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.MISSES", - "SampleAfterValue": "2000000", - "UMask": "0x2" - }, - { - "BriefDescription": "L1I Instruction fetches", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.READS", - "SampleAfterValue": "2000000", - "UMask": "0x3" - }, - { - "BriefDescription": "Large ITLB hit", - "Counter": "0,1,2,3", - "EventCode": "0x82", - "EventName": "LARGE_ITLB.HIT", - "SampleAfterValue": "200000", - "UMask": "0x1" - }, { "BriefDescription": "All loads dispatched", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/nehalemep/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/nehalemep/virtual-memory.json index e88c0802e679..accd263cfbfd 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemep/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/nehalemep/virtual-memory.json @@ -88,6 +88,14 @@ "SampleAfterValue": "200000", "UMask": "0x20" }, + { + "BriefDescription": "Large ITLB hit", + "Counter": "0,1,2,3", + "EventCode": "0x82", + "EventName": "LARGE_ITLB.HIT", + "SampleAfterValue": "200000", + "UMask": "0x1" + }, { "BriefDescription": "Retired loads that miss the DTLB (Precise Eve= nt)", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B13981EF0AD for ; Sat, 22 Mar 2025 06:35:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625328; cv=none; b=YeSlleAZHE0/JMJdoD5Nh2PpLxyLs7AZgO/cJVCAlxCRK6855pkSwSJV8GhHwkFT7Nf3fbFurkEioxovwA0PXu/nVivntTRpJeDf36iXtKKoshxhQfqmEfM90CBlu0D7kdZfqRzueVToGJqIaIy+xUA37UqRxTPO7opShc4inBc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625328; c=relaxed/simple; bh=Zy0jRafJxRLJGt4ZHFy+/ySVLU1klhzlJJANwxxpiE0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=HdmiA/fYFk1HjW7dufqgkUvwSByLSOKJAKEDuSWYmA4+qfeLqCvMilO6qjEaBbBNKeLCbhYGKZwt9ciYun7+t7UDEBP5xPnOdyHjkIEGfOt/YUZmzU9QMrnh7Y9jyoibA5Kr13aj9KDy7y99F0D75YuP2A1DFWqBmkHI+twuM5Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=K7aLPiRu; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="K7aLPiRu" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e63533f0a65so3286059276.1 for ; Fri, 21 Mar 2025 23:35:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625322; x=1743230122; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=kWTSLGRSQc6JnxhtpbR0m8qTat7TUVK+43G8X5ABzRg=; b=K7aLPiRuMuUhzssm/055SjpA5uiYdUJtr2U5C3wZxDf61G/e6npE/59NdHki9f7QvG q5sUoyFJVU8L/GSDMnoxLMOX0oikKZj63Dfpm44Ql8ga3O5U1T/vyYkljLzSoyTOQb9q jajPtlcnYg1km6oI+Llz3m/4ur1MMmlpJ840D7XdQIPFuYOg+KlQKSgB4TjaVjfXrf2O l8RC/6gG0QMNTU3bKVpKp/qzAmZTGWMfxgpIgTb2rCSUc1l8prP++K9NVqS21faK9Tf1 /HcK5Gn0r/5nIdBYAH8cPtd74x64tGnODzlmaPBWONJDx+sQw7LibR8KmF7QeMROyXeV HYhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625322; x=1743230122; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kWTSLGRSQc6JnxhtpbR0m8qTat7TUVK+43G8X5ABzRg=; b=Ovpp4UtCzxFi7tBV6+FN5M1W5FIu7VLDRSFKVeAUbagCs4vSKF3IFEO5qoxw5C1XTW jKUQj/CY3jjo1Ar7JybG01p9IrWyXSAyUdtBNZJ3onJ5eZiK+YUbLA0eA3G3oGmyUZjP m+/KA3qq8sTIwePGD0qnX8TQwyZNRv2I1eQOnEvKysgQp6iMIQB9gmVnNXEghtzUlYib aCF3kEjciMf/jhsLxd9co0ycSIzTLoVMcHVmI+LLMluHRS7cyVp+N6mUzKxYJmYwmJL1 ZaRIHryDsPMRvcgNiqV68cXX1jfqAhrk8bz+Mh1F4Im3XD0c9I2t290o0C5mAFKdxg8G Oa5w== X-Forwarded-Encrypted: i=1; AJvYcCXu5znJO+uHXFTYzztHckBAEmhvKriXVxSNtkn55SXSQAJ7DdeFQrfqqbHtOgEj9aPLcmoALKudWpTcQho=@vger.kernel.org X-Gm-Message-State: AOJu0YxXS469xmqE4lktbZQTByr/2TVR8mwIU9HLo0EVoGZDy9cHRkx6 7dvxs7Hco8mOnEMJH9ThbUDPduLTgFiKq4ivbQ84/BAYPWJqv39k/TlHeRVkycUwWxP/JL1fPUR zyiRESA== X-Google-Smtp-Source: AGHT+IGSbm8DZTNj2tJOV8rbZgMWyrPfl+jo6jeiPT8Kbr8VSUps3O4D1Bq9dlrB4SH+pQQcwHWPvmEz+w+2 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:9108:0:b0:e60:90e0:fa83 with SMTP id 3f1490d57ef6-e66a31a5fb5mr11276276.1.1742625322493; Fri, 21 Mar 2025 23:35:22 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:52 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-25-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 24/35] perf vendor events: Update nehalemex events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and virtual memory. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/nehalemex/cache.json | 32 +++++++++++++++ .../pmu-events/arch/x86/nehalemex/other.json | 40 ------------------- .../arch/x86/nehalemex/virtual-memory.json | 8 ++++ 3 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/nehalemex/cache.json b/tools/pe= rf/pmu-events/arch/x86/nehalemex/cache.json index 2c0ea6f8c4e0..b6c6b22a3188 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemex/cache.json +++ b/tools/perf/pmu-events/arch/x86/nehalemex/cache.json @@ -239,6 +239,38 @@ "SampleAfterValue": "100000", "UMask": "0x2" }, + { + "BriefDescription": "L1I instruction fetch stall cycles", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.CYCLES_STALLED", + "SampleAfterValue": "2000000", + "UMask": "0x4" + }, + { + "BriefDescription": "L1I instruction fetch hits", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.HITS", + "SampleAfterValue": "2000000", + "UMask": "0x1" + }, + { + "BriefDescription": "L1I instruction fetch misses", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.MISSES", + "SampleAfterValue": "2000000", + "UMask": "0x2" + }, + { + "BriefDescription": "L1I Instruction fetches", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.READS", + "SampleAfterValue": "2000000", + "UMask": "0x3" + }, { "BriefDescription": "All L2 data requests", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/nehalemex/other.json b/tools/pe= rf/pmu-events/arch/x86/nehalemex/other.json index f6887b234b0e..5fe5ca778e9f 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemex/other.json +++ b/tools/perf/pmu-events/arch/x86/nehalemex/other.json @@ -15,46 +15,6 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, - { - "BriefDescription": "L1I instruction fetch stall cycles", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.CYCLES_STALLED", - "SampleAfterValue": "2000000", - "UMask": "0x4" - }, - { - "BriefDescription": "L1I instruction fetch hits", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.HITS", - "SampleAfterValue": "2000000", - "UMask": "0x1" - }, - { - "BriefDescription": "L1I instruction fetch misses", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.MISSES", - "SampleAfterValue": "2000000", - "UMask": "0x2" - }, - { - "BriefDescription": "L1I Instruction fetches", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.READS", - "SampleAfterValue": "2000000", - "UMask": "0x3" - }, - { - "BriefDescription": "Large ITLB hit", - "Counter": "0,1,2,3", - "EventCode": "0x82", - "EventName": "LARGE_ITLB.HIT", - "SampleAfterValue": "200000", - "UMask": "0x1" - }, { "BriefDescription": "All loads dispatched", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/nehalemex/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/nehalemex/virtual-memory.json index e88c0802e679..accd263cfbfd 100644 --- a/tools/perf/pmu-events/arch/x86/nehalemex/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/nehalemex/virtual-memory.json @@ -88,6 +88,14 @@ "SampleAfterValue": "200000", "UMask": "0x20" }, + { + "BriefDescription": "Large ITLB hit", + "Counter": "0,1,2,3", + "EventCode": "0x82", + "EventName": "LARGE_ITLB.HIT", + "SampleAfterValue": "200000", + "UMask": "0x1" + }, { "BriefDescription": "Retired loads that miss the DTLB (Precise Eve= nt)", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 96D461C5D4E for ; Sat, 22 Mar 2025 06:35:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625342; cv=none; b=Y67f712FtkQqsTtDVXL+B/xdTOx20/cZdA0WtVz0t98D/uWe33NxgpHIArDjJH8EsNTw7FEWg7Yl7FEvgAZfnm2yCBWD2yRWtYad2nwN8fz9+bFvOyUtr3XSV2gX+mrRdAPFNpKyHNyKupK212XVVt7uEVAzb2Tb/zVcfzoxeTo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625342; c=relaxed/simple; bh=vUsqIw7sq2MnoeK2keexpdKbPhRbEDYXRLGzphWtfnE=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=mNCmZ1vANUpW5JPZr0PJA4lR5WJ2dcCSxd+zF5I6XnoGabv04mThj8qw8DnYaYIsUwZs2bINHcxWTbaRyVMMQ87FXAqk3n3ERBUwGc67yVLFng5MjbISaRm/spwMjRuLWipcHbLzSAr9jMir1F0NVygyklkwqFLvYmwqUVazv/g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ztSE7fcU; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ztSE7fcU" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fef9aaefb9so28967297b3.3 for ; Fri, 21 Mar 2025 23:35:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625325; x=1743230125; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=2IP+cCqCUJLVI6CQa7WeJ28k5iF1L55KDmdpsl9mkcQ=; b=ztSE7fcU6C9yDdWZoJlCuZXq+zs19vh2hk0dTnMK0IZZclnqQtbyo96f7gXH8wIMDu oIz5U/CxszB+9gNPeIZNhU4m7PlOMIy1OPpPk5V/3Ib+O0Q4qc+ZPQrQ2PLM5UZY4hgK v5PbfGJ23AaQKin0LKgyCIl/09jvrEQg25YlpGwHynsuiDptgg6niFnAzE22S+Ql8cOB KjPX6k2TBlUnrgQErVpM4anLIxeGzVLxuYoXb/3LZnFLRsKvd+o9gsSUnnDQOvdFyzyt LSMWQnwB5OP4T3HmGzJvWO6byuCtyl7FbD6cCUosMvo0KgAXu69MyHr5noCqTf9Ps64G cPgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625325; x=1743230125; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2IP+cCqCUJLVI6CQa7WeJ28k5iF1L55KDmdpsl9mkcQ=; b=njqtAmYyRKT4TGbv5aRm+XgNQ3LqZldQ+hI60vT8Q+uUdLCgNAqpH4WA7T3e2GdZ1+ LMBZv6tTxJLO5SXpHPIk1RfY4wCaUnBL3M9iATtRSFT/n/Xj5uCP+wU7n/ONdWy/2HnM G1f6NO2VzBVrSkXKXdqSBRZPNnnGEupLFrG8wNug03wb+J7za0FjFUJ0C9bmkdZzb/iA CP0Z/RXWKzCG+uwfVrjSaqzZ7YMQsrR9UW+NSo7RCHZLEMzUM0vXWPIqSkBeUTKMXZEg pN3S2IUruDOlZBnCORt3JTMqRdP9XF1fKr6r4fgqHAJn8nbGVfCUPKPSiJiYCIQDZMh0 y9bQ== X-Forwarded-Encrypted: i=1; AJvYcCU54kg7FwAynO3xrF1ch+LkeUuocDHvXaNLqQLrFLR/QLqbJPzaTShQ4hjXmBFvazOgrsp+7nNI/DFDOJY=@vger.kernel.org X-Gm-Message-State: AOJu0YxmhKO5R7+kUTlItJjs28b0Dg7fT9SgIIt8/s3nSwqpKSWNdqsh kza2XULxicoLCVTYyUqRGctcXBGBNt0acmDM5z4/e413QBWIXOfpb2t5YghVEX6hf+CLYyx9rW4 /ZztvzA== X-Google-Smtp-Source: AGHT+IGmiQ/axDfmZhnw8w7Wy//kjvYr/zssaM4/V8POhfH1zV95fY5nqvBXV2H0X9wJGGMa9oxtbKQPFVkD X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:dc9:b0:6fe:e77c:7741 with SMTP id 00721157ae682-700bad2e1b1mr966727b3.8.1742625325043; Fri, 21 Mar 2025 23:35:25 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:53 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-26-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 25/35] perf vendor events: Update rocketlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/rocketlake/cache.json | 60 +++ .../arch/x86/rocketlake/memory.json | 160 ++++++++ .../pmu-events/arch/x86/rocketlake/other.json | 220 ---------- .../arch/x86/rocketlake/rkl-metrics.json | 385 +++++++++--------- 4 files changed, 412 insertions(+), 413 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json b/tools/p= erf/pmu-events/arch/x86/rocketlake/cache.json index 791fa526d192..0f543325ec1a 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/cache.json @@ -445,6 +445,16 @@ "SampleAfterValue": "50021", "UMask": "0x20" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit a cacheline in the L3 where a snoop was s= ent or not.", "Counter": "0,1,2,3", @@ -505,6 +515,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit a cacheline= in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", @@ -565,6 +585,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit a = cacheline in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", @@ -625,6 +655,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that hit a cacheline in the L3 where a= snoop was sent or not.", "Counter": "0,1,2,3", @@ -655,6 +695,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that hit a cacheline in the L3 where a snoop was sent or n= ot.", "Counter": "0,1,2,3", @@ -715,6 +765,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that hit a cacheline in the L3 where a snoop was sent or not.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json b/tools/= perf/pmu-events/arch/x86/rocketlake/memory.json index abaf3f4f9d63..1455aaac37b1 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/memory.json @@ -176,6 +176,16 @@ "SampleAfterValue": "50021", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -186,6 +196,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that was not supplie= d by the L3 cache.", "Counter": "0,1,2,3", @@ -196,6 +226,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that was no= t supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -206,6 +256,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that was not supplied by the L3 cache.= ", "Counter": "0,1,2,3", @@ -216,6 +286,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -226,6 +316,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -236,6 +346,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.OTHER.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184008000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that was not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -246,6 +376,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.OTHER.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184008000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.STREAMING_WR.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000800", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that was not supplied= by the L3 cache.", "Counter": "0,1,2,3", @@ -256,6 +406,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", + "Counter": "0,1,2,3", + "EventCode": "0xB7, 0xBB", + "EventName": "OCR.STREAMING_WR.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000800", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/other.json b/tools/p= erf/pmu-events/arch/x86/rocketlake/other.json index a96b2a989d3f..141cd30a30af 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/other.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/other.json @@ -26,186 +26,6 @@ "SampleAfterValue": "200003", "UMask": "0x20" }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that DRAM supplied t= he request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that DRAM s= upplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L1 data cache prefetch requests and so= ftware prefetches (except PREFETCHW) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L1D_AND_SWPF.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch data reads (which br= ing data to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetch RFOs (which bring da= ta to L2) that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that have any type of response.", "Counter": "0,1,2,3", @@ -216,26 +36,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.OTHER.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184008000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts miscellaneous requests, such as I/O an= d un-cacheable accesses that DRAM supplied the request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.OTHER.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184008000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -245,25 +45,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.STREAMING_WR.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000800", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts streaming stores that DRAM supplied th= e request.", - "Counter": "0,1,2,3", - "EventCode": "0xB7, 0xBB", - "EventName": "OCR.STREAMING_WR.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000800", - "SampleAfterValue": "100003", - "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/t= ools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json index cfda8956353e..71737a1a1997 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,12 +129,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -149,7 +149,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -157,7 +157,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_stor= e_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_false_sharing = + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -165,22 +165,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_br= anches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tm= a_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms /= (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_clears_resteers + tma_= mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_= dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -188,7 +188,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -196,15 +196,15 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -216,17 +216,17 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", "MetricThreshold": "tma_bottleneck_useful_work > 20" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -248,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -257,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -266,24 +266,24 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -291,7 +291,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -299,33 +299,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -335,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -362,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -372,7 +372,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -382,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -390,26 +390,26 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -417,8 +417,8 @@ "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { @@ -437,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -447,7 +447,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -465,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -474,15 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -490,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -499,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -508,7 +508,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -517,7 +517,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -526,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -538,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -556,8 +556,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -569,28 +569,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -619,7 +619,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -628,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -637,10 +637,11 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -701,11 +702,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -718,20 +719,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -773,7 +774,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -791,7 +792,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -799,7 +800,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -807,7 +808,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -815,7 +816,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -823,7 +824,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -831,7 +832,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -886,7 +887,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 11", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1011,7 +1012,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1073,8 +1074,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1101,12 +1102,12 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1147,14 +1148,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1195,7 +1195,7 @@ "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1203,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1211,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1239,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1248,15 +1248,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1266,13 +1265,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1288,14 +1287,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 7.5" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1303,8 +1302,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1312,7 +1311,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1321,7 +1320,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1331,7 +1330,7 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1340,7 +1339,7 @@ "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1350,17 +1349,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1368,18 +1367,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1396,7 +1395,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1404,31 +1403,31 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1437,7 +1436,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1447,7 +1446,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%" }, { @@ -1457,15 +1456,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1474,7 +1473,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1485,11 +1484,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1511,7 +1510,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1526,24 +1525,24 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D1@ / tma_info_core_core= _clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1554,7 +1553,7 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1563,7 +1562,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1578,19 +1577,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1634,8 +1633,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1643,8 +1642,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1652,7 +1651,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1661,7 +1660,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1670,14 +1669,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1690,7 +1689,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1699,7 +1698,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1709,7 +1708,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1718,8 +1717,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1727,7 +1726,7 @@ "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1736,8 +1735,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1746,8 +1745,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1755,8 +1754,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1773,7 +1772,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -1781,31 +1780,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1813,7 +1812,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1822,7 +1821,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1831,8 +1830,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D4801C07C3 for ; Sat, 22 Mar 2025 06:35:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625334; cv=none; b=d69SemX06HcKZnxRoamePOmwYBTeCyGtIk2UyfA4Mzzsuj2z0AgpfvehSMZns2t+llIaOJTqCjssvAqw/1wIynQO2CLps1dQxRBjihhgAOkAk81TjQJJmgjq2K6VMVJXGlbw56MoNONdBTmKRcFhP67w/2WgfvJ3YCotIGJ5Yi0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625334; c=relaxed/simple; bh=Lp8ulElygeIKQ2bznVf0xqu32MtL52Ef9WdWB4yntak=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=CoQ5ff7xyYcHNw6i7asnyacDDfSsIX3Npx0HAl+dAAuTKLrlUwdAeD5El+0dkg9zCIinFnIQijDTZeaftRCgBYQPpSsJ2MDEGzVAHzEjgSxnR1Ze6o+F2/dCRzemjL6oTzoBFOlUl80ZbO/uYD5zFqqHLP4t8yTdQD15f3wrglo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=iymkMcUJ; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="iymkMcUJ" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f2bc451902so35084437b3.3 for ; Fri, 21 Mar 2025 23:35:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625329; x=1743230129; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=Rht91ZAFHO6cO1rQpcP3HoB9XpYfuaUL/MILI96bbSc=; b=iymkMcUJnTQLmFxAXmkVF/FaCKqb6SEFo54195mhJASb4AA63HqPwSqy85nfTeOQq0 HqUn7gbDOWtBnLIaaCxxTAxApblGAqZylQNbUvuegAtbgeuTTq/C65Z4ZgpWnbQzC0Gv jIB77jFO8FMPrL89sXGCMnw1NXJyztYbfTFdqLhqzyWpzRJCV85zZgg+SUEvv/aCcmzC E/FEPsVEecWHz3UnK15yO6ea3mjH97XflidM7Kv+11GwYhvvS3CjaK1ixNBwzFOEPXWF 8fD6qtuICjdaFl2/lBWaW96zZVjiyUFOGcFRthnNatEG4N5c1jLd7w2e2/hLAHNfeNwJ EsdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625329; x=1743230129; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=Rht91ZAFHO6cO1rQpcP3HoB9XpYfuaUL/MILI96bbSc=; b=TZ5Q0eLCPMdpXjqSufpZV9/6Pqirx9iBX/rg+EHkv6nt3LPKvRyuzktwczRB1jyRx3 obFVPAJ+jSXQbDbCyGakjRDNxD3/PPHgiNKW7mw2LfzoJO0/vN9dOUpA2brQXhurbGps hZlEEZbuW0XArEQbVXjFCeDR1jMBOkJy+EGstTG5Y73vzrm8jlofqpH5pP4WCakngW4A LOGLsFjHleqQcTsXcbaEL/VtC2dg9g7b2yH6kV0Jccg/oBEPCH2k/bVr0BNTIRE4TGbV D2Uye/uSBGeGxJ1fRTOcxhsiDIDYAhNN6BEr879pGlYL+Mp4p2v7qGi7pkcdZ0NRr5OH 3K3A== X-Forwarded-Encrypted: i=1; AJvYcCUqDiDkQNFxye2fAOsUg+IMc2shxRmMMXgjcnl5vvlL1yIccpWRSHch47EuMYDhtR5TFf9G16SuF/GSdrQ=@vger.kernel.org X-Gm-Message-State: AOJu0YxZv5rFjvYLXDgk2KAjDzDW6g29e+p3BWvMY1jTKQv1rwSPF3GS XvMVDZzQ1lHTlBVrqSDRF2vm3Gv6ir67s6lul7err3RaBlRgsDo0Ypg6Ow1Ieut93YA79JO3jYZ /ha8WBw== X-Google-Smtp-Source: AGHT+IH4RfjZ+VcFLhe0CR/oSiCVmIAPgeSgHMqcjjmFuCcASb00kxzEkr4ntEzSoSfD8YoqH/NOlhQlcjus X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:6e0e:b0:6fb:9c08:4990 with SMTP id 00721157ae682-700bacbc2b7mr53297b3.4.1742625327908; Fri, 21 Mar 2025 23:35:27 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:54 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-27-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 26/35] perf vendor events: Update sandybridge metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.02. Move INSTS_WRITTEN_TO_IQ.INSTS to the frontend topic. Signed-off-by: Ian Rogers --- .../arch/x86/sandybridge/frontend.json | 8 +++++ .../arch/x86/sandybridge/metricgroups.json | 5 +++ .../arch/x86/sandybridge/other.json | 8 ----- .../arch/x86/sandybridge/snb-metrics.json | 36 ++++++++++++++----- 4 files changed, 41 insertions(+), 16 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json b/too= ls/perf/pmu-events/arch/x86/sandybridge/frontend.json index e95d1005e22f..5c9ab7680762 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/frontend.json @@ -278,5 +278,13 @@ "EventName": "IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE", "SampleAfterValue": "2000003", "UMask": "0x1" + }, + { + "BriefDescription": "Valid instructions written to IQ per cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x17", + "EventName": "INSTS_WRITTEN_TO_IQ.INSTS", + "SampleAfterValue": "2000003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json b= /tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json index 7dc7eb0d3dd3..eb8fbd14138a 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -33,6 +34,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -48,6 +50,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -75,6 +78,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +103,7 @@ "tma_issueSmSt": "Metrics related by the issue $issueSmSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/other.json b/tools/= perf/pmu-events/arch/x86/sandybridge/other.json index 42692fa24b6c..970839a9c786 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/other.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/other.json @@ -33,14 +33,6 @@ "SampleAfterValue": "2000003", "UMask": "0x2" }, - { - "BriefDescription": "Valid instructions written to IQ per cycle.", - "Counter": "0,1,2,3", - "EventCode": "0x17", - "EventName": "INSTS_WRITTEN_TO_IQ.INSTS", - "SampleAfterValue": "2000003", - "UMask": "0x1" - }, { "BriefDescription": "Cycles when L1 and L2 are locked due to UC or= split lock.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json b/= tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json index ff2e515c744a..823d8b7c4224 100644 --- a/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sandybridge/snb-metrics.json @@ -127,7 +127,7 @@ "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -220,7 +220,7 @@ "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -240,7 +240,7 @@ "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -275,6 +275,12 @@ "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 4 > 0.35", "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_lcp" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Total number of retired Instructions", "MetricExpr": "INST_RETIRED.ANY", @@ -290,7 +296,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -308,14 +314,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_mem_bandwidth" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE + FP_COMP_OPS_EX= E.SSE_SCALAR_DOUBLE + 2 * FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE + 4 * (FP_COMP_= OPS_EXE.SSE_PACKED_SINGLE + SIMD_FP_256.PACKED_DOUBLE) + 8 * SIMD_FP_256.PA= CKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -340,6 +346,13 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -352,6 +365,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -448,7 +468,7 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_info_system_dram_bw_use", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B294B1F237E for ; Sat, 22 Mar 2025 06:35:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625353; cv=none; b=oDHqKe1lIxeQaNqO1FNbnBgWoOWv/bmVGifZ0oo+WjpHgDGqkN1EVGiBxDYgMLr/HFfVIvPzlXYSRpGK1pfN3gb0dAGFeiT4WbJuh5FPfw6fpQsuprwwUiK/Ad22D57wdSPOjYqWuDKKwMvx4dq5LdRqv6qWLzlOqTL0qyLrbnc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625353; c=relaxed/simple; bh=FeR6hBQQa7dkRUZfqcb9WP1KtcyoB4PtVMfoM9MobAM=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=FVVc0pputOnhahwD7DQ13PFZ0gvg0zmk/2SHXi0eV7KL9DjcpP1lXpuOz1KGlfocccXFYE9arGsxrBF5Iumjuk04gG1d610ktxCgya5oZm654uaNXTReda9omQC4cJ0WH7Az9diUy5xurXMUO2TOCv92KpvxJPtbfVYPbc6TRy0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hnxI8w8Q; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hnxI8w8Q" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f2c7008c05so35280397b3.0 for ; Fri, 21 Mar 2025 23:35:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625332; x=1743230132; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=zOBHuwy/zhiHKrGsubhIMyKwsk98oTcSleIPQIVkOXQ=; b=hnxI8w8QE+lEhE3pNSeS1Rs6vOqgsqR0SdZvsRcZ5uFh2Cj34nVaW06UHHtdnmVXBx HcRZYAKky47Y6L5T/r/5psM7heXVLX2BC88YY0xgRjQ37b16E/Wb9+R4MOylA0sn6ZXz NfNRTPdfc6Mm6+zuXd3hEbRolElMbFJQUNHzgZBXKWx/jpMjkZSx+NjKoA5mRDdNX3S5 wpKHyRBjM5x17Vv+als0iz/bgoZWeM5xGnqSSBo0ywY3er2jcXRwgIXdNJUd8Qwn9In5 1S5Csmut9SduOzMuyXZTR2PKn4pVRxkD7Zlksy9aKTR2RR/Mdk55xKHNINZi47L5fdWv U2DQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625332; x=1743230132; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=zOBHuwy/zhiHKrGsubhIMyKwsk98oTcSleIPQIVkOXQ=; b=M4gSuX8UXb3bfr/OAgKZDJBpipGg5eyrZT2h5Y4QsWjWsTMIHpHhBRAsvieoCpvUwQ oVFPqskb9BBrOL1Kqd+9urKeeCk05JRc4KXwG63b7gCGruPvEHbwH+9FQCwte4M3GKQX z9dnhOciaKrXhyw13b4WRB8TgsTfqZU7np7R1XAyQzYU0F366033R3b2PEx3AGUAXN3X 0x/404q12vb9hzgFqgVlh0FgUF5qAeOzUAzwubAHk12dj/qMwSp+7cyis8wUbbhNo+Xz 0i7llAIlwCJJ9dyHSczBCeNC9nsOYwpvqynlnhD78e5m5pptC71MHBhKpuEvgEUgGc4/ yLBQ== X-Forwarded-Encrypted: i=1; AJvYcCUPeyoRPrPPVhAJMV51u3tW1fajlgQbNysnP9V8UN+jwJGZj5j4Engrf52JHU8ZvLF5e8EHYE3SFGt1bJw=@vger.kernel.org X-Gm-Message-State: AOJu0YztAzimeze4SufKB8ubman4QwHBOtL9H54Nvhhutp7Sn3yTOPxG PII4EwDT38kvYoaMqjAZ0Yzk9c13VFIgYjzQJ04pH7gIlZXPGquAhJXv9fz9Juy+VTJu4y6P6qW K0UReCA== X-Google-Smtp-Source: AGHT+IECoy3ThHIDVHp40UzJCkotapjYSDH5OCwj+qfJu6xCe7SVhfSWItEv4IzHLANDg5sVPuMRkMggK9X8 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:2b81:b0:6fd:a048:7898 with SMTP id 00721157ae682-700bac0aa50mr28047b3.1.1742625331379; Fri, 21 Mar 2025 23:35:31 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:55 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-28-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 27/35] perf vendor events: Update sapphirerapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../arch/x86/sapphirerapids/cache.json | 150 ++++++ .../arch/x86/sapphirerapids/memory.json | 170 +++++++ .../arch/x86/sapphirerapids/other.json | 378 -------------- .../arch/x86/sapphirerapids/pipeline.json | 58 +++ .../arch/x86/sapphirerapids/spr-metrics.json | 465 +++++++++--------- 5 files changed, 610 insertions(+), 611 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json b/too= ls/perf/pmu-events/arch/x86/sapphirerapids/cache.json index e35dbb7c2ccd..4363e53169f7 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json @@ -588,6 +588,16 @@ "SampleAfterValue": "1000003", "UMask": "0x3" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that hit in the L3 or were snooped from another co= re's caches on the same socket.", "Counter": "0,1,2,3", @@ -628,6 +638,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit in the L3 o= r were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -668,6 +688,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to this socket, whether or not in Sub NUMA Cluster(SNC) Mode= . In SNC Mode counts PMM accesses that are controlled by the close or dist= ant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_SOCKET_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700C00001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703C00001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit a modified line in another c= ore's caches which forwarded the data.", "Counter": "0,1,2,3", @@ -688,6 +728,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that hit a modified = line in a distant L3 Cache or were snooped from a distant core's L1/L2 cach= es on this socket when the system is in SNC (sub-NUMA cluster) mode.", "Counter": "0,1,2,3", @@ -708,6 +758,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -748,6 +808,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts data load hardware prefetch requests t= o the L1 data cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L1D.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetches (which bring data = to L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L2.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10070", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L3.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x12380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t hit in the L3 or were snooped from another core's caches on the same sock= et.", "Counter": "0,1,2,3", @@ -758,6 +848,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.HWPF_L3.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x90002380", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10808", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC4477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit in the L3 or were snooped from another core's caches on the sa= me socket.", "Counter": "0,1,2,3", @@ -798,6 +918,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, whether or not in Su= b NUMA Cluster(SNC) Mode. In SNC Mode counts PMM accesses that are control= led by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x700C04477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", "Counter": "0,1,2,3", @@ -828,6 +968,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_PMM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x703004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that hit a modified line in a distant L3 Cache or were snooped from a d= istant core's L1/L2 caches on this socket when the system is in SNC (sub-NU= MA cluster) mode.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json b/to= ols/perf/pmu-events/arch/x86/sapphirerapids/memory.json index 41d4120d4dae..981e573330cd 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json @@ -173,6 +173,16 @@ "SampleAfterValue": "1000003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the local socket's L1, L= 2, or L3 caches.", "Counter": "0,1,2,3", @@ -183,6 +193,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -193,6 +233,46 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were n= ot supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -203,6 +283,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts hardware prefetches to the L3 only tha= t missed the local socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -223,6 +323,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x73C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches.", "Counter": "0,1,2,3", @@ -253,6 +363,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x733004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.SNC_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x708004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that missed the local= socket's L1, L2, and L3 caches.", "Counter": "0,1,2,3", @@ -273,6 +433,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json b/too= ls/perf/pmu-events/arch/x86/sapphirerapids/other.json index 05d8f14956ee..df4019ff7883 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/other.json @@ -7,324 +7,6 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, - { - "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xb7", - "EventName": "EXE.AMX_BUSY", - "SampleAfterValue": "2000003", - "UMask": "0x2" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM attached to this socket= , unless in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts only those DRAM= accesses that are controlled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM on a distant memory con= troller of this socket when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_CODE_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to this socket, whether or not in Sub NUMA Cluster(SNC) Mode= . In SNC Mode counts PMM accesses that are controlled by the close or dist= ant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_SOCKET_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700C00001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703C00001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM on a distant memory controller of this socket when the system is in = SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_DATA_RD.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC0002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mo= de. In SNC Mode counts only those DRAM accesses that are controlled by the= close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM on a distant memory controller of this socket when the syst= em is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.DEMAND_RFO.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts data load hardware prefetch requests t= o the L1 data cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L1D.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches (which bring data = to L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L2.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10070", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L3.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x12380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts hardware prefetches to the L3 only tha= t were not supplied by the local socket's L1, L2, or L3 caches and the cach= eline was homed in a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.HWPF_L3.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x90002380", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts writebacks of modified cachelines and = streaming stores that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.MODIFIED_WRITE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10808", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F3FFC4477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x73C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x104004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x70C004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to this socket, whether or not in Su= b NUMA Cluster(SNC) Mode. In SNC Mode counts PMM accesses that are control= led by the close or distant SNC Cluster.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x700C04477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F33004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x733004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by PMM attached to another socket.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.REMOTE_PMM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x703004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM on a distant memory controller of this socke= t when the system is in SNC (sub-NUMA cluster) mode.", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.READS_TO_CORE.SNC_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x708004477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -335,66 +17,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", - "Counter": "0,1,2,3", - "EventCode": "0x2A,0x2B", - "EventName": "OCR.WRITE_ESTIMATE.MEMORY", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFBFF80822", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY", - "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", - "SampleAfterValue": "1000003", - "UMask": "0x7" - }, - { - "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_COUNT", - "Invert": "1", - "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", - "SampleAfterValue": "100003", - "UMask": "0x7" - }, - { - "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa5", - "EventName": "RS.EMPTY_RESOURCE", - "SampleAfterValue": "1000003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", - "Counter": "0,1,2,3,4,5,6,7", - "CounterMask": "1", - "Deprecated": "1", - "EdgeDetect": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.COUNT", - "Invert": "1", - "SampleAfterValue": "100003", - "UMask": "0x7" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", - "Counter": "0,1,2,3,4,5,6,7", - "Deprecated": "1", - "EventCode": "0xa5", - "EventName": "RS_EMPTY.CYCLES", - "SampleAfterValue": "1000003", - "UMask": "0x7" - }, { "BriefDescription": "Cycles the uncore cannot take further request= s", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json index 50cacfbbc7cf..c16b63979c55 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json @@ -367,6 +367,14 @@ "SampleAfterValue": "1000003", "UMask": "0x4" }, + { + "BriefDescription": "Counts the cycles where the AMX (Advance Matr= ix Extension) unit is busy performing an operation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb7", + "EventName": "EXE.AMX_BUSY", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", "Counter": "0,1,2,3,4,5,6,7", @@ -740,6 +748,56 @@ "SampleAfterValue": "100003", "UMask": "0x2" }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= due to a resource in the back-end", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY_COUNT", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "Deprecated": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.COUNT", + "Invert": "1", + "SampleAfterValue": "100003", + "UMask": "0x7" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = RS.EMPTY", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xa5", + "EventName": "RS_EMPTY.CYCLES", + "SampleAfterValue": "1000003", + "UMask": "0x7" + }, { "BriefDescription": "TMA slots where no uops were being issued due= to lack of back-end resources.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json= b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json index b59fae4a887d..fc87899a2168 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json @@ -360,7 +360,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -372,7 +372,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", + "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", "ScaleUnit": "100%" }, { @@ -380,12 +380,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -395,7 +395,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -411,12 +411,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -431,7 +431,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -439,7 +439,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependen= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_= bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma= _l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_fb= _full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tm= a_store_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_= latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_s= tore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)) + tma_mem= ory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store= + tma_false_sharing + tma_split_stores + tma_store_latency + tma_streaming= _stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -447,22 +447,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms / (tma_ds= b + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) + t= ma_fetch_bandwidth * tma_ms / (tma_dsb + tma_mite + tma_ms)) + 10 * tma_mic= rocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_bra= nch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes + = tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_info_= thread_clks * tma_ports_utilized_0) / (tma_amx_busy + tma_divider + tma_por= ts_utilization + tma_serializing_operation) + tma_microcode_sequencer / (tm= a_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_mic= rocode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -470,7 +470,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound= + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_sto= re_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -478,7 +478,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency + tma_streaming_stores - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -486,7 +486,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -498,10 +498,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -510,7 +510,7 @@ { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", @@ -523,24 +523,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -548,7 +548,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%" }, @@ -557,24 +557,24 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -582,7 +582,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -590,32 +590,32 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -626,24 +626,24 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -652,7 +652,7 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -661,7 +661,7 @@ "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -671,7 +671,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -679,34 +679,34 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricExpr": "(170 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_MISS@offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core_frequ= ency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, @@ -727,7 +727,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -738,7 +738,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -756,7 +756,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -765,15 +765,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -781,8 +781,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -790,8 +790,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -799,8 +799,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -808,8 +808,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -817,8 +817,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", "ScaleUnit": "100%" }, { @@ -829,27 +829,27 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" }, { @@ -857,8 +857,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -869,28 +869,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -918,7 +918,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -926,7 +926,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -934,10 +934,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -998,11 +999,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -1015,20 +1016,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1065,13 +1066,13 @@ }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1089,7 +1090,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1097,7 +1098,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1105,7 +1106,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1113,7 +1114,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1121,7 +1122,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1129,7 +1130,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1137,7 +1138,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1192,7 +1193,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 13", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1329,7 +1330,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1394,21 +1395,21 @@ "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1436,8 +1437,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1458,18 +1459,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1532,14 +1533,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1550,7 +1550,7 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" @@ -1560,11 +1560,11 @@ "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", "MetricGroup": "LockCont;MemOffcore;Server;SoC", "MetricName": "tma_info_system_mem_irq_duplicate_address", - "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + "MetricThreshold": "tma_info_system_mem_irq_duplicate_address > 0.= 1" }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1598,7 +1598,7 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricExpr": "uncore_cha_0@event\\=3D0x1@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, @@ -1628,7 +1628,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1637,15 +1637,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1655,13 +1654,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1677,14 +1676,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 9" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1693,7 +1692,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%" }, { @@ -1701,8 +1700,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1710,8 +1709,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1719,8 +1718,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1728,7 +1727,7 @@ "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1737,7 +1736,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1746,7 +1745,7 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1755,7 +1754,7 @@ "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1764,17 +1763,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1782,19 +1781,19 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1811,7 +1810,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1819,39 +1818,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1860,7 +1859,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1876,19 +1875,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1897,32 +1896,32 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1943,7 +1942,7 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1957,17 +1956,17 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clks = / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1975,10 +1974,10 @@ }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%" }, @@ -1988,7 +1987,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1996,7 +1995,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -2010,19 +2009,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -2031,7 +2030,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%" }, { @@ -2040,7 +2039,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2049,7 +2048,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2058,7 +2057,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -2066,8 +2065,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -2075,8 +2074,8 @@ "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -2084,7 +2083,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -2094,8 +2093,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%" }, { @@ -2104,32 +2103,32 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2142,7 +2141,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2151,8 +2150,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", "ScaleUnit": "100%" }, { @@ -2161,7 +2160,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2171,7 +2170,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -2179,8 +2178,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -2188,7 +2187,7 @@ "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -2197,8 +2196,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -2206,8 +2205,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -2215,8 +2214,8 @@ "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2233,7 +2232,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -2241,31 +2240,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2273,7 +2272,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2282,7 +2281,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2291,8 +2290,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2E401F37D1 for ; Sat, 22 Mar 2025 06:35:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625340; cv=none; b=Zu5MKqVPRKyaiW7jRXA+I4IXyUiPHWCmhDvgeYRSDmC/+VwY91ftzb7pmRMU3iu9HvC+Y/aODqgEz9TD/HN9id3PtKl9/LslvLBiD847cYRs6HKrrxxF+CBQ36zHKTzPiDy3cckDHmTgcXXCCtsE77NBUoFmtFR23iQ17WKAYvk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625340; c=relaxed/simple; bh=ctFqUDQckKzVSG33T+IsdXdx8dTadzpOv0SmZRQHLw8=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=e1oGI+l+93TgzWMH0Zt7z7Bo/pM4Hv3ic48nELdSTrP9hlyMBYWcnFwvj1R2okNJUvkrPZj+CpAeitPD0PT4+Ho0R9cuTJ0AthhRqV5943O6w4sAbgrWO2F/eEnUqHcqDCOZvn1YTO7FfUv1gGfzixBFhLXSwEUKSEs6XNk9CFw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Zj2YUO0C; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Zj2YUO0C" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f4348c854eso31734657b3.2 for ; Fri, 21 Mar 2025 23:35:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625335; x=1743230135; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=hCeVNN7TIWImHS0j62KJ6VLYzh9l1k/vX/z1AR1Qmr4=; b=Zj2YUO0Caa7/O84uX6md2nfLg81iWelapW8RjGfraW0G7tz3ueyLgY1izcoEaEYK0Y I82XpjvX1U0ELzXFH1NKJ64Z8NQ+d5LnKujVr5a9qwzF3lIrT9LPKqh3LfdFP+MVOCP9 KQaFH1eN2ayVURQb6wMsR5+I4Z3wwtYzQmgsOQI/1k1tITDluq/yaedmddLyHDVJYObn J9D4wsM6RbSCKokzJb6+gfBH6XhG2mZoL9ycbXwprqLJDKs7AqfwLu78OFDPVCt7eVYv hcyE/jTD/4VrICPBzRTdtAVS1QowI/h2obSyb62JRnpNHnTDfodoEj7e8lO2BIP3Kkd9 ZL0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625335; x=1743230135; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=hCeVNN7TIWImHS0j62KJ6VLYzh9l1k/vX/z1AR1Qmr4=; b=RXBhWrxnu82OOqlDzVreZgmP1/TjlKkTIgpqK8fDR9v40bZZ4F3UWCo1PwTJkTt4J0 sRUvc7/IyZfkLofIyEkrdKRXD0EO/U3zeoqFwhyqPn/WIvgrv60kxA/hiFa1sG0Q47Q4 gIaq8vDUIk1LbGSEja9S8kk69aZq+Dn8oi/YBIGESL1hGrz9ADWnBATXfik5XSKNvt6B maopeQuJx8fRNSCoR6wgvbGddH+k2UUo6BINmINXGs3u6Z9Ny0Bsbc+GMdPwZggP1W8L ei491EFMjrjNsSrUpV+eIXN/zOHH6oBl3KuQ29Kl6Pi1DP7/OAViESQe20/Yxl4ejs6t 8J1A== X-Forwarded-Encrypted: i=1; AJvYcCUhXB88yxIXI3lpWCIn2hFOWbCzX/dtYsyYdJuPbM2dV3scmqzHdtwWy4PXz3vvGmBah9NQtlm7t2SrKJw=@vger.kernel.org X-Gm-Message-State: AOJu0YyxCG3jNJflkLsi/634AcP7DGYlbEAy114OL4a1bDb0RTtHgYqe OgooceH1xdt25AvSwXvT8Tb0tzBPWx+1w2XvheISIM8efReOsztIqUJ/uHNLtJ5VQ8crWee56AR x9OEdUA== X-Google-Smtp-Source: AGHT+IFGcHJsP7czHiNgTeXXByD8ILYoJ0lIeefL5I0HwRpMqcypTiBwO8HVqFWXFxtUfQmqaT7j7ZkpoVvZ X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:3383:b0:6fb:8461:e7f4 with SMTP id 00721157ae682-700bacc218dmr325847b3.3.1742625334919; Fri, 21 Mar 2025 23:35:34 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:56 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-29-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 28/35] perf vendor events: Update sierraforest events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.08 to v1.09. Update event topics. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sierraforest/cache.json | 20 ++ .../arch/x86/sierraforest/memory.json | 20 ++ .../arch/x86/sierraforest/other.json | 48 ---- .../arch/x86/sierraforest/pipeline.json | 8 + .../arch/x86/sierraforest/uncore-cache.json | 32 +++ .../arch/x86/sierraforest/uncore-memory.json | 240 ++++++++++++++++++ 7 files changed, 321 insertions(+), 49 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 0c16c9b840a5..bde2f32423a1 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -29,7 +29,7 @@ GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core GenuineIntel-6-8F,v1.25,sapphirerapids,core -GenuineIntel-6-AF,v1.08,sierraforest,core +GenuineIntel-6-AF,v1.09,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core GenuineIntel-6-55-[01234],v1.36,skylakex,core diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/cache.json b/tools= /perf/pmu-events/arch/x86/sierraforest/cache.json index 072df00aff92..21671c65d6dd 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/cache.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/cache.json @@ -466,6 +466,16 @@ "SampleAfterValue": "1000003", "UMask": "0x6" }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and modified data was= forwarded.", "Counter": "0,1,2,3,4,5,6,7", @@ -486,6 +496,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and modified data was= forwarded.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/memory.json b/tool= s/perf/pmu-events/arch/x86/sierraforest/memory.json index 22d23077618e..3e2d0b565cfa 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/memory.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/memory.json @@ -82,6 +82,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/other.json b/tools= /perf/pmu-events/arch/x86/sierraforest/other.json index 4c77dac8ec78..daa16030d493 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/other.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/other.json @@ -8,46 +8,6 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, - { - "BriefDescription": "Counts demand data reads that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x730000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5,6,7", @@ -57,13 +17,5 @@ "MSRValue": "0x10800", "SampleAfterValue": "100003", "UMask": "0x1" - }, - { - "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x75", - "EventName": "SERIALIZATION.C01_MS_SCB", - "SampleAfterValue": "200003", - "UMask": "0x4" } ] diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json b/to= ols/perf/pmu-events/arch/x86/sierraforest/pipeline.json index df2c7bb474a0..a934b64f66d0 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json @@ -300,6 +300,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL_P]", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json = b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json index a779a1a73ea5..7182ca00ef8d 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json @@ -873,6 +873,38 @@ "UMask": "0x1", "Unit": "CHA" }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking exclusive lines in the cores? cache.? Snoop filter capacity = evictions occur when the snoop filter is full and evicts an existing entry = to track a new entry.? Does not count clean evictions such as when a core?s= cache replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.E_STATE", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : E state", + "UMask": "0x2", + "Unit": "CHA" + }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking modified lines in the cores? cache.? Snoop filter capacity e= victions occur when the snoop filter is full and evicts an existing entry t= o track a new entry.? Does not count clean evictions such as when a core?s = cache replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.M_STATE", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : M state", + "UMask": "0x1", + "Unit": "CHA" + }, + { + "BriefDescription": "Counts snoop filter capacity evictions for en= tries tracking shared lines in the cores? cache.? Snoop filter capacity evi= ctions occur when the snoop filter is full and evicts an existing entry to = track a new entry.? Does not count clean evictions such as when a core?s ca= che replaces a tracked cacheline with a new cacheline.", + "Counter": "0,1,2,3", + "EventCode": "0x3d", + "EventName": "UNC_CHA_SF_EVICTION.S_STATE", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Snoop Filter Capacity Evictions : S state", + "UMask": "0x4", + "Unit": "CHA" + }, { "BriefDescription": "All TOR Inserts", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json= b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json index ae9c62b32e92..c7e9dbe02eb0 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json @@ -188,6 +188,94 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 temp readings forced 2x refre= sh", + "Counter": "0,1,2,3", + "EventCode": "0xA7", + "EventName": "UNC_M_MR4_2XREF_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH0_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles MR4 MRRs was triggered/running", + "Counter": "0,1,2,3", + "EventCode": "0xA6", + "EventName": "UNC_M_PDC_MR4ACTIVE_CYCLES.SCH1_DIMM1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, { "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", "Counter": "0,1,2,3", @@ -286,6 +374,70 @@ "PublicDescription": "-", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM and throttle level is zero.", + "Counter": "0,1,2,3", + "EventCode": "0x89", + "EventName": "UNC_M_POWER_CRITICAL_THROTTLE_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT0", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.BW_SLOT1", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "MR4 temp reading is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.MR4BLKEN", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "RAPL is throttling", + "Counter": "0,1,2,3", + "EventCode": "0x46", + "EventName": "UNC_M_POWER_THROTTLE_CYCLES.RAPLBLK", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -480,6 +632,94 @@ "UMask": "0x1", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Critical level on s= pecified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8e", + "EventName": "UNC_M_THROTTLE_CRIT_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at High level on speci= fied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8d", + "EventName": "UNC_M_THROTTLE_HIGH_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Normal level on spe= cified DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8b", + "EventName": "UNC_M_THROTTLE_LOW_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles Throttling at Mid level on specif= ied DIMM", + "Counter": "0,1,2,3", + "EventCode": "0x8c", + "EventName": "UNC_M_THROTTLE_MID_CYCLES.SLOT1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBB691CBA02 for ; Sat, 22 Mar 2025 06:35:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625352; cv=none; b=cyXtPL8yLSJs53iAm2zq9CF8GVH3e7ubHchdd/pglmIqKvfDTqTKz1i0Fkf8OVdlC84kcHttie9YW1UzJs7jWkMRNHoCDDUtw5hrcqk1tV9ZcJq9o9UC9VXsCnI2fQXwCNKuTmHvknADpxs7ioOp1D7fFIXOo5ow4nX4cJH6dKU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625352; c=relaxed/simple; bh=yXsuB7eBiZQnUjcSPmlU08BWoINyiu6kQe45F7dclcY=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=StTFeZBur4s9yjhuUSFDW8IX05phmL3RGNDBl+aKaxZq4LSShRLocYeNJcNKyaKwOylpLtjs2jPrrspiyCDIBzB7qWCWl+cSz4IFFX+zGEesYms0oz8QHdvwpuK9Z+t+1rGXnznKsIgmAw4gWJ0CF7bIG5U2SHFXqCo3y1k+fyc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wmNcUO4g; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wmNcUO4g" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fcfa304ef4so36339157b3.0 for ; Fri, 21 Mar 2025 23:35:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625338; x=1743230138; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=gXuRrw3IxGYFDpF/XF089NtvMxpystzkyGvzqRqcA/0=; b=wmNcUO4gAlC2qzUbpVTCVEZkMEu6SXwblGApDm3B1MIFmbDGbHJMMeZnj0mHwQLudv zLXyVgX0uDfv28Oi/y9+6ncCmMpxQ6GI3tCWO58TbKpX54jiOJ8x8UVq8u2UCRj/IQof Wh7mqIMaHQbYdVGZhRV5TtoiyRru8fdxkSFuNDpMXZI695Z7pt8Ar26hH8ncmiWl7d+6 WpqrN5PPVeG/CzBWfmxM5leowKW8HoB8ZeisGkpGlbZEDNs7FUFcexrbJOV9qloslPyB Jb02XjPk8YL07xtkFkh8L6BG6hMsvVZvMla+xRaq+MDgbT3KuGBCkCK2T9s9mb2wCots YrVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625338; x=1743230138; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=gXuRrw3IxGYFDpF/XF089NtvMxpystzkyGvzqRqcA/0=; b=k2w7oi+Ab+r36QHfqgSH0InSXdZPb4XdPjCNio2OEvG0AkW/mscJ7Hw0OW9o6KsQsl U9M3k+iFWwDDp5OqIkpmQMeyth82jzzXSydurfWaQsJOZGBxlo7lFGIfpQlUnyFQ85jT HHZp9CPSEGSPUSD+KMZHths+Z62GdCTu+dq0YTRiB+UjRYNwYFxZfIkDz4X0IEB5OnOy I99aroFyeeWbsWGiv1uPkiTqvbsPEwoBfaizcU2oYmxSSNMN8r6iipFIbAb6+oz+Iwyo 3DtR7a9BVC9LIfxNxrIBzbzhw+H/qk+Iun7l6x/pss/usiXSS3P7u+TXAZeL9kOgGgUX rD8w== X-Forwarded-Encrypted: i=1; AJvYcCVuOU5lDVaekO6RtPiZW5zBnMRegHNjf3sVF6T4XbseNCFjW3LcrxlRSQNrJh1PfHu1da5sv6R7It/NJyc=@vger.kernel.org X-Gm-Message-State: AOJu0YwqZO3uYHvrg9hzl0Zpx0PeBMBTM2TLDBU5oEwz+CP/jPICK4lq 0xt9zBhOpCICC9XxWV/0vssbvCunXvEIxl90RCwc6AEc1uRr11I80yA3pDpAcQuviM+cqgR9eFQ E36tIgA== X-Google-Smtp-Source: AGHT+IHdT5zyzMgj4OJdN0I9XORJeCz0VndsWxX7jywRl08Bc2369I06gFuP8S0tFPTWF+JAum2OYhket6Ht X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:86:b0:6f9:d2ce:45e4 with SMTP id 00721157ae682-700ab26accdmr140887b3.1.1742625337621; Fri, 21 Mar 2025 23:35:37 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:57 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-30-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 29/35] perf vendor events: Update skylake metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/skylake/skl-metrics.json | 367 +++++++++--------- 1 file changed, 183 insertions(+), 184 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json b/tool= s/perf/pmu-events/arch/x86/skylake/skl-metrics.json index 2a76dd01fb52..2d3a037e88b5 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -91,7 +91,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -112,12 +112,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -132,7 +132,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -140,7 +140,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency)) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -148,22 +148,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -171,7 +171,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -179,15 +179,15 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -199,10 +199,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -224,8 +224,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -233,8 +233,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -242,7 +242,7 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -251,7 +251,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -259,33 +259,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((22 * tma_info_system_core_frequency - 3.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (20 * tma_= info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LO= AD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(18.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HITM + 16.5 * tma_info_system_core_frequency * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED= .L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -296,25 +296,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(20 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "16.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -323,7 +323,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -333,7 +333,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -343,7 +343,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -351,27 +351,27 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -380,18 +380,18 @@ "MetricExpr": "22 * tma_info_system_core_frequency * OFFCORE_RESPO= NSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related m= etrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_= data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tm= a_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -401,7 +401,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -411,7 +411,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -431,7 +431,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -440,7 +440,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { @@ -448,8 +448,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -458,8 +458,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -467,8 +467,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -476,8 +476,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -487,35 +487,35 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -526,11 +526,11 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -555,7 +555,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -564,7 +564,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -572,10 +572,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -604,7 +605,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -632,11 +633,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -649,20 +650,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -698,7 +699,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -717,7 +718,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -725,7 +726,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -733,7 +734,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -741,7 +742,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -749,7 +750,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -799,7 +800,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -974,8 +975,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -996,12 +997,12 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1043,14 +1044,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1061,7 +1061,7 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D0x1@", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1112,7 +1112,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1121,15 +1121,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1155,15 +1154,15 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1171,7 +1170,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1180,17 +1179,17 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1199,7 +1198,7 @@ "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1208,17 +1207,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(10 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "6.5 * tma_info_system_core_frequency * (MEM_LOAD_RE= TIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= ) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1226,18 +1225,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1255,7 +1254,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1263,31 +1262,31 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1295,7 +1294,7 @@ "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (9 = * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAND= ING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1307,15 +1306,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1324,7 +1323,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1336,11 +1335,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1362,7 +1361,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1376,12 +1375,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1389,7 +1388,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1399,7 +1398,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1407,8 +1406,8 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP_PS", "ScaleUnit": "100%" }, { @@ -1421,19 +1420,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1442,7 +1441,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1451,7 +1450,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1487,7 +1486,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1496,7 +1495,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1513,8 +1512,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1522,8 +1521,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1531,7 +1530,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1540,16 +1539,16 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1567,7 +1566,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1578,7 +1577,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1586,8 +1585,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1595,7 +1594,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1604,8 +1603,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1613,8 +1612,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1623,8 +1622,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / M= EM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1640,7 +1639,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -1648,31 +1647,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1680,7 +1679,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1689,8 +1688,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CCEB1F4176 for ; Sat, 22 Mar 2025 06:35:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625356; cv=none; b=pGFJzc6GbcUNDgj01LGQr2iiJwkw+zEsWJSZ0tLsp+0zf2SdCj+SejNKg3C1LBtEKhiYwOwSZFfQlR1jGCfuc7Q7LzZk5ld7D/XRZ8A8kDF1NfJ6hngqJC5E+BvfWF9Gar/vN25WdQFa/Wgcnxhfz8KyXeUv2dqhnpstFNpLXME= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625356; c=relaxed/simple; bh=Smd1jKiEWEGZJ6nNESE82L+COZ29iPn3G/A1ZhFwfeM=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Y8Y0mfgDN9JbpjD/3UpZtPb1M6bCkImhYNtgidPAj++02hNvViIch+20jrijWQnRUcXTE9/yC9K71qHtfF/2+FJ62Pd2YErsoN7JIvk3aCwHYV/Uds6nHwe00eUCvHxa2uvNZIbEktJ8GZGpLlvsBS/5bePnTxdBj8rQ7xeBMp4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=OLyCz4aR; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="OLyCz4aR" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6fec1d75f7aso34037007b3.0 for ; Fri, 21 Mar 2025 23:35:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625340; x=1743230140; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=kF38Hh4HUGFV2sUBf6Ek+tmHe3WJPjIAvKIShcoR8VY=; b=OLyCz4aR7wOzReMVQFxlOKmUqfn592tS8Z+re1OC/ygchm3g4DY2UufIkoBYED3qNr b7mnMoqikL9CIUydvPxsz1cvAnHVdrm5i0yLR41RPDbw9h1bEDn+4zi9Z1rgXhjISlTt vKWDUBBkP+MrJKQ8DGssMa96H/breldgc5nfFXq2w1FuNnf4ANSN+j6Foa1x9fDSk2Lc 5QydCPXbA8SLy9Kgn11F9zUYGnKQAdxKwr+6qAMYwUtibtvZinUCVqS4rmE/NszU4oEZ KymaTCUiUR6XYngSJ+WoJvi4F8JKIwaX3IlcoIb327kXldobHBO4/IlOJadVQI9RMVf4 GsJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625340; x=1743230140; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=kF38Hh4HUGFV2sUBf6Ek+tmHe3WJPjIAvKIShcoR8VY=; b=jpdz5Qtsd0NA1LKFrEXas8yJSoBbDwiMnzK3xT9mFmNHWYVvOYmwvyC+q51mWw6YcU FugD/0wLMI6YxyDJ73z19MVWyhpU6eUSPm9CvVGtsW2QH5LhTxpJPmj5hUfljVe0uVpo AFLnqo2Z2x3k35clQKIeoht+08vjWMV+MvdSY1f+d0liOXOrNenydIFNh1I6d53pBFdB /PqgYZOYxSXzZsJnPDIQrKyH/fmPbrq+dU7jakUMUegF1JfgDfhXV7TwmssSqmg9VWpm mXMyVhWRinGNTyGcwi+FdvivpVsJBQNfyoZ3JsM/gTQr3Ei0NRrMFydJOBQxsGm4YZNy 9muw== X-Forwarded-Encrypted: i=1; AJvYcCUtkcMc/JCZ8H6Huaqfog6LmHP1o2M1p95hLqByb107E2lApLpryzIWxQPKyT4V76je25CPrESHJSzn/qY=@vger.kernel.org X-Gm-Message-State: AOJu0YwbLan+vWP46xaDOvJHGfZkyBZlFaL8PMmHk7rr21ttmpiUMvhf geiyes4FH8V+ynteWKnS2X9AdwmbGt2pl7717XUxW3sMRJkhBO7q9zI4z+IaW1ZoAQcpesov5Y+ U3BdJ9A== X-Google-Smtp-Source: AGHT+IGXNIHNlMN9d3L/LY8E5zcdE3PM+UArkIp+ee50Gmf4/cuafL5rU8h7NqtS7PcBOKCcVcc/1JQeQw4r X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:5a8f:b0:6fe:afd0:2083 with SMTP id 00721157ae682-700bacc4c78mr78377b3.3.1742625340082; Fri, 21 Mar 2025 23:35:40 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:58 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-31-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 30/35] perf vendor events: Update skylakex events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topics, metrics to be generated from the TMA spreadsheet and other small clean ups. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/skylakex/cache.json | 74 ++++ .../pmu-events/arch/x86/skylakex/other.json | 74 ---- .../arch/x86/skylakex/skx-metrics.json | 385 +++++++++--------- 3 files changed, 266 insertions(+), 267 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/skylakex/cache.json b/tools/per= f/pmu-events/arch/x86/skylakex/cache.json index 2ce070629c52..7aeeb5725630 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/cache.json @@ -1,4 +1,78 @@ [ + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", + "SampleAfterValue": "2000003", + "UMask": "0x20" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDM", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDM", + "SampleAfterValue": "2000003", + "UMask": "0x10" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", + "SampleAfterValue": "2000003", + "UMask": "0x2" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITI", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITI", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", + "SampleAfterValue": "2000003", + "UMask": "0x40" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDM", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDM", + "SampleAfterValue": "2000003", + "UMask": "0x8" + }, + { + "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", + "Counter": "0,1,2,3", + "EventCode": "0xEF", + "EventName": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", + "SampleAfterValue": "2000003", + "UMask": "0x4" + }, + { + "BriefDescription": "Counts number of cache lines that are dropped= and not written back to L3 as they are deemed to be less likely to be reus= ed shortly", + "Counter": "0,1,2,3", + "EventCode": "0xFE", + "EventName": "IDI_MISC.WB_DOWNGRADE", + "PublicDescription": "Counts number of cache lines that are droppe= d and not written back to L3 as they are deemed to be less likely to be reu= sed shortly.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, + { + "BriefDescription": "Counts number of cache lines that are allocat= ed and written back to L3 with the intention that they are more likely to b= e reused shortly", + "Counter": "0,1,2,3", + "EventCode": "0xFE", + "EventName": "IDI_MISC.WB_UPGRADE", + "PublicDescription": "Counts number of cache lines that are alloca= ted and written back to L3 with the intention that they are more likely to = be reused shortly.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, { "BriefDescription": "L1D data line replacements", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/other.json b/tools/per= f/pmu-events/arch/x86/skylakex/other.json index 44c820518e12..adf7b6bb5838 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/other.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/other.json @@ -35,62 +35,6 @@ "SampleAfterValue": "200003", "UMask": "0x40" }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDFE", - "SampleAfterValue": "2000003", - "UMask": "0x20" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IFWDM", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IFWDM", - "SampleAfterValue": "2000003", - "UMask": "0x10" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITFSE", - "SampleAfterValue": "2000003", - "UMask": "0x2" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_IHITI", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_IHITI", - "SampleAfterValue": "2000003", - "UMask": "0x1" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDFE", - "SampleAfterValue": "2000003", - "UMask": "0x40" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SFWDM", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SFWDM", - "SampleAfterValue": "2000003", - "UMask": "0x8" - }, - { - "BriefDescription": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", - "Counter": "0,1,2,3", - "EventCode": "0xEF", - "EventName": "CORE_SNOOP_RESPONSE.RSP_SHITFSE", - "SampleAfterValue": "2000003", - "UMask": "0x4" - }, { "BriefDescription": "Number of hardware interrupts received by the= processor.", "Counter": "0,1,2,3", @@ -100,24 +44,6 @@ "SampleAfterValue": "203", "UMask": "0x1" }, - { - "BriefDescription": "Counts number of cache lines that are dropped= and not written back to L3 as they are deemed to be less likely to be reus= ed shortly", - "Counter": "0,1,2,3", - "EventCode": "0xFE", - "EventName": "IDI_MISC.WB_DOWNGRADE", - "PublicDescription": "Counts number of cache lines that are droppe= d and not written back to L3 as they are deemed to be less likely to be reu= sed shortly.", - "SampleAfterValue": "100003", - "UMask": "0x4" - }, - { - "BriefDescription": "Counts number of cache lines that are allocat= ed and written back to L3 with the intention that they are more likely to b= e reused shortly", - "Counter": "0,1,2,3", - "EventCode": "0xFE", - "EventName": "IDI_MISC.WB_UPGRADE", - "PublicDescription": "Counts number of cache lines that are alloca= ted and written back to L3 with the intention that they are more likely to = be reused shortly.", - "SampleAfterValue": "100003", - "UMask": "0x2" - }, { "BriefDescription": "MEMORY_DISAMBIGUATION.HISTORY_RESET", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json b/too= ls/perf/pmu-events/arch/x86/skylakex/skx-metrics.json index 2fe630cd4927..7cc7b076c3e2 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json @@ -295,12 +295,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -312,7 +312,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -323,7 +323,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", "ScaleUnit": "100%" }, { @@ -333,12 +333,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -353,7 +353,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -361,7 +361,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency)) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -369,22 +369,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -392,7 +392,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -400,7 +400,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", @@ -408,7 +408,7 @@ }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -420,10 +420,10 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", @@ -445,8 +445,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -454,8 +454,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -463,7 +463,7 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -472,7 +472,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -480,33 +480,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OFFCOR= E_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND= _DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SN= OOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_R= ESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 44 * tma_info_system_= core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED= .FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -517,25 +517,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM= _OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFF= CORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_R= ETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RES= PONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA= _RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_H= IT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / = 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -544,7 +544,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -554,7 +554,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -564,7 +564,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -572,27 +572,27 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -601,18 +601,18 @@ "MetricExpr": "(110 * tma_info_system_core_frequency * (OFFCORE_RE= SPONSE.DEMAND_RFO.L3_MISS.REMOTE_HITM + OFFCORE_RESPONSE.PF_L2_RFO.L3_MISS.= REMOTE_HITM) + 47.5 * tma_info_system_core_frequency * (OFFCORE_RESPONSE.DE= MAND_RFO.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.PF_L2_RFO.L3_HIT.HITM_OT= HER_CORE)) / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Rela= ted metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses,= tma_data_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tm= a_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -622,7 +622,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -632,7 +632,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -652,7 +652,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -661,7 +661,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { @@ -669,17 +669,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -688,7 +688,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -697,7 +697,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -706,7 +706,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -717,35 +717,35 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -756,11 +756,11 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -785,7 +785,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -794,7 +794,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -802,10 +802,11 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -834,7 +835,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -859,14 +860,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -879,20 +880,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -928,7 +929,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -943,11 +944,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -955,7 +956,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -963,7 +964,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -971,7 +972,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -979,7 +980,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -987,7 +988,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1037,7 +1038,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 9", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1224,8 +1225,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1246,12 +1247,12 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1307,14 +1308,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1332,7 +1332,7 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" @@ -1362,7 +1362,7 @@ "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1370,7 +1370,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1378,7 +1378,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1412,7 +1412,7 @@ "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1421,15 +1421,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1455,15 +1454,15 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 6" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1471,7 +1470,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1480,17 +1479,17 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1499,7 +1498,7 @@ "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1508,17 +1507,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1526,18 +1525,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1555,7 +1554,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1563,39 +1562,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", + "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1604,7 +1603,7 @@ "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1621,10 +1620,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1633,7 +1632,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1645,11 +1644,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1671,7 +1670,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1685,12 +1684,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1698,7 +1697,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1708,7 +1707,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { @@ -1716,8 +1715,8 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP_PS", "ScaleUnit": "100%" }, { @@ -1730,19 +1729,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1751,7 +1750,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1760,7 +1759,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1796,7 +1795,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1805,7 +1804,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1822,8 +1821,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1831,8 +1830,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1840,7 +1839,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1849,35 +1848,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data= _sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", + "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { @@ -1895,7 +1894,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1906,7 +1905,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1914,8 +1913,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1923,7 +1922,7 @@ "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1932,8 +1931,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1941,8 +1940,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1951,8 +1950,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1968,7 +1967,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -1976,31 +1975,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -2008,7 +2007,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -2017,8 +2016,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C9981F429C for ; Sat, 22 Mar 2025 06:35:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625349; cv=none; b=Ilp0NZYT6BHVwGTdzbyP4Xuf5jMS9uF8IM4FHP/EaxCRuFxTjs5o0h0m6YYrm0OI++oZx7C2yeJE+VxqkFFjlz51QQqCpo2JTeK5wvqEQgK/9p7Ihb8YO63TZy7SvIpP326vkfaDOZu20/0nPuGAlneLUkUQs19+9VEGX1qxgzg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625349; c=relaxed/simple; bh=0AT9S4+tQKLFJnJKnyWkGKU0/cg5F7/LVzqrzS1ky+g=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=IhZ0XctWUg1htqVbUmsgMoDgwxO9Dz3kxWbxfQXjo1lTPz09NvGDCiBkahk983lWqh7Yk/WK993YfBzpKRUy75GnraRD2Tf1BvEMGhL9iRRYa0maIJQ7P2/UEviK/BkVRvSyEfvGQG68SMbOEa9KuvfY4c13uF0DUJIsQFTTHDI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EEUBv9WW; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EEUBv9WW" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6f2c7008c05so35280997b3.0 for ; Fri, 21 Mar 2025 23:35:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625343; x=1743230143; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=umn9Nu09RuoBihtyilxpcZ9ZnsjZo8cvNgml2LAfPCY=; b=EEUBv9WWriSM6P/64Q1QcQbxY76h9ff/vu89iGRQuSQBKp4PjGRoufiHElhuX177yQ qMifY3y6VOETji4NOqkSMW7M+bwQXf+tcZvv+4CwMbbP0Sjpa8xMl1jSwAHUb1lnAmb3 eVvNhNx/w5pCYsBMjCXIS/Px1MHlJDhAMGjmWTP/h3dHlNPQOKlU8iTGGYUZyR3LPfc+ +MslbNm2rV2Kp3TNMnZ2V6dc9PlU6dzhCSUEXg2SdEZR46UDWk4ZUr0q09coUbuj+zo9 I+cNom1u6YNMwKSYXDR2E2c2poi15vOjAjw1KSbo1a867mt6NdPEukiZUgpHlyhi1su0 /TXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625343; x=1743230143; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=umn9Nu09RuoBihtyilxpcZ9ZnsjZo8cvNgml2LAfPCY=; b=dkTJOgGY9jY8fv3nhLY2XfXBFv/5RDidq8gJ88IQJANYfCsSaUiNHvxj9A8UdJOkbQ aC5kQdxPBJYxGuotR7TB5nQQ80Q1uBwZejFqrWeqhjogfaBYwhu0gzCL5GwOVFj927RT OgBb08Zkzi+kiMlO6T9Rq94dP+AZ2UFV3n4zbtfiMfYImdWZwY8ocqIwaw3UttibjONU F1ul09tsY9B1VxqKUoZsWW3T2+m4bSvYVGAWJvsC6Z3UaFFdFCQQ1wPd1i/XH+oZJ3kd +TORcT++sWGKk9Hw1+Yfx39z2s9CyFYngSwPFf2VqYrKFbD9su8vg+XYIcu4Lt4wK7t5 CnxQ== X-Forwarded-Encrypted: i=1; AJvYcCVZtRoXDfkzHUSfImKa62MKfutgPXCQDDP0CNaxbbVIMRiMNOKu9+nm4B6T5IluIQaiboLRo/5N8fS7B64=@vger.kernel.org X-Gm-Message-State: AOJu0YzH587ugAn6Q8Yf/MC/jTbrw8ZMC/O0hKoGK+4onoOoaEMg2Nex WE8PVax0E9FagYeBcJEuZb46OEsbwqrq/HgE3ujLd8iCExjB3x4z4yBUvaufS2sNfVjv5FVEKxB SnVreVQ== X-Google-Smtp-Source: AGHT+IFx3jJWoMsMzpZTrQppdxgV72TCUHMky142xm4h/9wkD+R3EK03E3V7HLhtOTCAvawAeQBHceurNIK5 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a05:690c:2b81:b0:6fd:a048:7898 with SMTP id 00721157ae682-700bac0aa50mr28077b3.1.1742625342556; Fri, 21 Mar 2025 23:35:42 -0700 (PDT) Date: Fri, 21 Mar 2025 23:33:59 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-32-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 31/35] perf vendor events: Update snowridgex events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and memory. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/snowridgex/cache.json | 192 +++++++++ .../arch/x86/snowridgex/memory.json | 202 +++++++++ .../pmu-events/arch/x86/snowridgex/other.json | 394 ------------------ 3 files changed, 394 insertions(+), 394 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/snowridgex/cache.json b/tools/p= erf/pmu-events/arch/x86/snowridgex/cache.json index 7882dca9d5e1..1bb42acf1d48 100644 --- a/tools/perf/pmu-events/arch/x86/snowridgex/cache.json +++ b/tools/perf/pmu-events/arch/x86/snowridgex/cache.json @@ -357,6 +357,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -367,6 +377,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have an outstanding request. Returns the number of cycles unt= il the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.COREWB_M.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8003000000000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -427,6 +457,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -487,6 +527,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve an outstanding request. Returns the number of cycles until the response = is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.L3_HIT", "Counter": "0,1,2,3", @@ -553,6 +614,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache.", "Counter": "0,1,2,3", @@ -613,6 +695,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have an outstan= ding request. Returns the number of cycles until the response is received (= i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -623,6 +715,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that have any type of resp= onse.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10400", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that were supplied by the = L3 cache where a snoop was sent, the snoop hit, and modified data was forwa= rded.", "Counter": "0,1,2,3", @@ -633,6 +735,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -693,6 +805,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have an outstanding request. Returns th= e number of cycles until the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -753,6 +885,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -813,6 +955,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have an outstanding request. Returns the numb= er of cycles until the response is received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.L1WB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -823,6 +985,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that have any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.L2WB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x2000000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that were supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -843,6 +1015,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e any type of response.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -903,6 +1085,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e an outstanding request. Returns the number of cycles until the response i= s received (i.e. XQ to XQ latency).", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.OUTSTANDING", + "MSRIndex": "0x1a6", + "MSRValue": "0x8000000000000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that were supplied by= the L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/snowridgex/memory.json b/tools/= perf/pmu-events/arch/x86/snowridgex/memory.json index 34306ec24e9b..260a488540bb 100644 --- a/tools/perf/pmu-events/arch/x86/snowridgex/memory.json +++ b/tools/perf/pmu-events/arch/x86/snowridgex/memory.json @@ -25,6 +25,16 @@ "SampleAfterValue": "200003", "UMask": "0x4" }, + { + "BriefDescription": "Counts all code reads that were supplied by D= RAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.ALL_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000044", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all code reads that were not supplied = by the L3 cache.", "Counter": "0,1,2,3", @@ -45,6 +55,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all code reads that were supplied by D= RAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.ALL_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000044", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -65,6 +85,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -85,6 +115,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -105,6 +155,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.L3_MISS", "Counter": "0,1,2,3", @@ -127,6 +198,27 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", + "Counter": "0,1,2,3", + "Deprecated": "1", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3", @@ -147,6 +239,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -167,6 +269,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -187,6 +299,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_CODE_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000040", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -207,6 +339,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000010", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -227,6 +379,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000020", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that were not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -317,6 +479,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e not supplied by the L3 cache.", "Counter": "0,1,2,3", @@ -337,6 +509,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that were not supplie= d by the L3 cache.", "Counter": "0,1,2,3", @@ -357,6 +539,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.UC_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100184000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts uncached memory reads that were not su= pplied by the L3 cache.", "Counter": "0,1,2,3", @@ -377,6 +569,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0XB7", + "EventName": "OCR.UC_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x100184000000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts uncached memory writes that were not s= upplied by the L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/snowridgex/other.json b/tools/p= erf/pmu-events/arch/x86/snowridgex/other.json index 57613207f7ad..35cdbfa617e7 100644 --- a/tools/perf/pmu-events/arch/x86/snowridgex/other.json +++ b/tools/perf/pmu-events/arch/x86/snowridgex/other.json @@ -116,26 +116,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all code reads that were supplied by D= RAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.ALL_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000044", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all code reads that were supplied by D= RAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.ALL_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000044", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts all code reads that have an outstandin= g request. Returns the number of cycles until the response is received (i.e= . XQ to XQ latency).", "Counter": "0,1,2,3", @@ -146,180 +126,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.COREWB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writebacks from L1 cache and = L2 cache that have an outstanding request. Returns the number of cycles unt= il the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.COREWB_M.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8003000000000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000004", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that we= re supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts cacheable demand data reads, L1 data c= ache hardware prefetches and software prefetches (except PREFETCHW) that ha= ve an outstanding request. Returns the number of cycles until the response = is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.ANY_RESPONSE", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.DRAM", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.LOCAL_DRAM", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = OCR.DEMAND_DATA_AND_L1PF_RD.OUTSTANDING", - "Counter": "0,1,2,3", - "Deprecated": "1", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_DATA_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000001", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have an outstan= ding request. Returns the number of cycles until the response is received (= i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.DEMAND_RFO.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000002", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", "Counter": "0,1,2,3", @@ -330,146 +136,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts L1 data cache hardware prefetches and = software prefetches (except PREFETCHW and PFRFO) that have any type of resp= onse.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L1D_AND_SWPF.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10400", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch code reads = (written to the L2 cache only) that have an outstanding request. Returns th= e number of cycles until the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_CODE_RD.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000040", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch data reads = (written to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_DATA_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000010", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that were supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts L2 cache hardware prefetch RFOs (writt= en to the L2 cache only) that have an outstanding request. Returns the numb= er of cycles until the response is received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.HWPF_L2_RFO.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000020", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writebacks from L1 cache that= miss the L2 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.L1WB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x1000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts modified writeBacks from L2 cache that= miss the L3 cache that have any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.L2WB_M.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x2000000010000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts miscellaneous requests, such as I/O ac= cesses, that have any type of response.", "Counter": "0,1,2,3", @@ -500,46 +166,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e any type of response.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.ANY_RESPONSE", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x10477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that wer= e supplied by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x184000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts all data read, code read and RFO reque= sts including demands and prefetches to the core caches (L1 or L2) that hav= e an outstanding request. Returns the number of cycles until the response i= s received (i.e. XQ to XQ latency).", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.READS_TO_CORE.OUTSTANDING", - "MSRIndex": "0x1a6", - "MSRValue": "0x8000000000000477", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -560,26 +186,6 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, - { - "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.UC_RD.DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100184000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, - { - "BriefDescription": "Counts uncached memory reads that were suppli= ed by DRAM.", - "Counter": "0,1,2,3", - "EventCode": "0XB7", - "EventName": "OCR.UC_RD.LOCAL_DRAM", - "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x100184000000", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "Counts uncached memory reads that have an out= standing request. Returns the number of cycles until the response is receiv= ed (i.e. XQ to XQ latency).", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8EEF01F4628 for ; Sat, 22 Mar 2025 06:35:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625360; cv=none; b=JzRlrbrome7g+C+79dopWZk47KJIGLhDLYyWIT9eqs+SM/vin0HH6GaY/CiD5fYwethhalVy9J/MA52oyOqBK9CUHaLkUrNwtIaj5dYAkWGDUcVDV/KkvbpmzUTBZNMleqi65Rx2b5BXj5kLpvjeidE12UnUyH5KMf9JoK0lPBE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625360; c=relaxed/simple; bh=NI07KYNDZ+eoog7me4jDr2tImkQxoQ/XfiHMbFV8adg=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=SxOm1Wo+9qMtd0+dtgJVX7UkxFeVW+1QCkPo2Jmr1G3+nsY9evfuZFUiwYB17ghOZuazu9GAYHuEKGOcYPCWgwr9w2mKwILdg14OuIsHGnCPP/mpdnu4tLQcAGe+L64TCe9bchHTedxuxNuBCNaXRnocMWr0svRYBtyY0I/hkUo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=NeJ7BdIp; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="NeJ7BdIp" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6feb1097d64so32446037b3.2 for ; Fri, 21 Mar 2025 23:35:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625345; x=1743230145; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=ycfq65OQXWk86ftKFh9rcavRUbjHb+EOA1eMhpCwt3E=; b=NeJ7BdIpwn/kdK2ugTnmr+d40ajphThY14M2FVhCoBjkx1qf9zEQJ6EXWrWazgEte1 taTm6cHUIGpG0UAbgHQFMtoyoISrnrPwM9z1zTF+pas4UC7yN2qKLT56T6N+q2cCswrJ yUjvUftWl9+7T8RymCfIIDeGuD4lmWSGPASx1URcxrgXW5zyeR1OeiIF42lzs841fTaw Iz33dMxw7E5pLKc3vQd/8TxSdGeJFm2/j12eT9/lI3ECOcMc2EBZ2RL8b8rf2BJZCZe0 3PgnEyR80ZBpusJNRM+GoPEnF+TfEZGPFTslZvQDlzCSUd+mhZDGFF1gSplQoiOYEr83 ePUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625345; x=1743230145; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=ycfq65OQXWk86ftKFh9rcavRUbjHb+EOA1eMhpCwt3E=; b=UYqI9VL4WoSb7JOZWpq21bkIeE7jnNpsg2AOO7x54Anc35Aj9+vVN8sJCOrKGkTcE8 YqvqC2xZaq0P5bOp4bTBbmGf6TCOCYp/tp9GaujxrAPsrc/VMIdbna/nLJzu9pifuf+U Q9fmexc04VjgmLxrAV+xlUXeus3u5vzJgGbESiz6hx/SbIwAcQQtWpfhzrDhR/L/Sabf dRKsCXyrFqTgxFiDX7jgDj21GIi7QNm1b5uKwuvaSPFIRM9Y92vksHO/UbVolB+otTEF 5jSjGI+tCqg0P/DgHTlJh4J3XGztVlLN+YX57tqoCTmmG3bsfd55Tox1XSeWoF8WBWkL cCbA== X-Forwarded-Encrypted: i=1; AJvYcCVOmcZ20tbdtSBhKbMGkomTvrYUaas8I1NmvaBMAsOLmM55lEna+tHT+2ayG0V/wB2b5d2UM0TT7sU74gU=@vger.kernel.org X-Gm-Message-State: AOJu0Yw4hGsUotlIgbBPibGRt8Wr7QJi9YyTob7HOzIcfCvA8l4GdrP6 VEm7I6jaJ21fqRBYEZOHlI4/8VnOUbt6fIIG+pr8leSVpS4rEdlNNaN04c3me94ZK43Y+G+E3S4 ScAeogQ== X-Google-Smtp-Source: AGHT+IGBgxwXkcbZ9bhxQyjZsa+r+Bi65WhgG2mBseARBHghRQYgrcXDAXjg/84xQVHDV3i8lCnM1oVsdG/N X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a81:b245:0:b0:6fe:abd9:60f1 with SMTP id 00721157ae682-700babc118emr52707b3.1.1742625345211; Fri, 21 Mar 2025 23:35:45 -0700 (PDT) Date: Fri, 21 Mar 2025 23:34:00 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-33-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 32/35] perf vendor events: Update tigerlake metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Switch to metrics generated from the TMA spreadsheet. Minor threshold simplification. Signed-off-by: Ian Rogers --- .../arch/x86/tigerlake/tgl-metrics.json | 383 +++++++++--------- 1 file changed, 191 insertions(+), 192 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/to= ols/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json index 8c0cd6e63a2a..2db7a70f7a07 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,12 +129,12 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" }, { "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", "MetricName": "tma_bottleneck_big_code", "MetricThreshold": "tma_bottleneck_big_code > 20" @@ -149,7 +149,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + = tma_store_fwd_blk)))", "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", "MetricName": "tma_bottleneck_cache_memory_bandwidth", "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", @@ -157,7 +157,7 @@ }, { "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l= 1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_b= lk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * = (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_boun= d + tma_store_bound)) * (tma_split_loads / (tma_4k_aliasing + tma_dtlb_load= + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + tma_split_l= oads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * = (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_stor= e_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_false_sharing = + tma_split_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", "MetricName": "tma_bottleneck_cache_memory_latency", "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", @@ -165,22 +165,22 @@ }, { "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", "MetricGroup": "BvCB;Cor;tma_issueComp", "MetricName": "tma_bottleneck_compute_bound_est", "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " }, { "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_br= anches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tm= a_itlb_misses + tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_ms /= (tma_dsb + tma_lsd + tma_mite + tma_ms))) - tma_bottleneck_big_code", "MetricGroup": "BvFB;Fed;FetchBW;Frontend", "MetricName": "tma_bottleneck_instruction_fetch_bw", "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" }, { "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_clears_resteers + tma_= mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_= dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_dsb + tma_lsd + tma_mite + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", "MetricName": "tma_bottleneck_irregular_overhead", "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", @@ -188,7 +188,7 @@ }, { "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_latency_dependency + tma_lock_latency + = tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_spli= t_stores + tma_store_latency + tma_streaming_stores)))", "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", "MetricName": "tma_bottleneck_memory_data_tlbs", "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", @@ -196,15 +196,15 @@ }, { "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", "MetricName": "tma_bottleneck_memory_synchronization", "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" }, { "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", "MetricName": "tma_bottleneck_mispredictions", "MetricThreshold": "tma_bottleneck_mispredictions > 20", @@ -216,17 +216,17 @@ "MetricGroup": "BvOB;Cor;Offcore", "MetricName": "tma_bottleneck_other_bottlenecks", "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." }, { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", "MetricGroup": "BvUW;Ret", "MetricName": "tma_bottleneck_useful_work", "MetricThreshold": "tma_bottleneck_useful_work > 20" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -248,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, { @@ -257,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, { @@ -266,24 +266,24 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache.", "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_hit", - "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_hit > 0.05 & (tma_icache_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache.", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", "MetricName": "tma_code_l2_miss", - "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_l2_miss > 0.05 & (tma_icache_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -291,7 +291,7 @@ "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_hit", - "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & (tma_itlb_misses > = 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { @@ -299,33 +299,33 @@ "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", "MetricName": "tma_code_stlb_miss", - "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & (tma_itlb_misses >= 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_2m", - "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses.", "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", "MetricName": "tma_code_stlb_miss_4k", - "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & (tma_code_stlb_= miss > 0.05 & (tma_itlb_misses > 0.05 & (tma_fetch_latency > 0.1 & tma_fron= tend_bound > 0.15)))", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((54 * tma_info_system_core_frequency - 5 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (53 * tma_info_system_core_frequ= ency - 5 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricExpr": "(49 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) + 48 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false_= sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -335,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(53 * tma_info_system_core_frequency - 5 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", + "MetricExpr": "48 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -362,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -372,7 +372,7 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, @@ -382,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, { @@ -390,26 +390,26 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_bottleneck_= memory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { @@ -417,8 +417,8 @@ "MetricExpr": "54 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { @@ -437,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_fr= ontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -447,7 +447,7 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" }, { @@ -465,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, { @@ -474,15 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active.", "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_fp_divider", - "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_fp_divider > 0.2 & (tma_divider > 0.2 & (t= ma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -490,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -499,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -508,7 +508,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -517,7 +517,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -526,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -538,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -556,8 +556,8 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { @@ -569,28 +569,28 @@ "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate).", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -619,7 +619,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite + tma_ms)))", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", @@ -628,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -637,10 +637,11 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -701,11 +702,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -718,20 +719,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -773,7 +774,7 @@ "MetricName": "tma_info_frontend_tbpc" }, { - "BriefDescription": "Branch instructions per taken branch", + "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -791,7 +792,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -799,7 +800,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -807,7 +808,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -815,7 +816,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -823,7 +824,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -831,7 +832,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -886,7 +887,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", + "MetricThreshold": "tma_info_inst_mix_iptb < 11", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1011,7 +1012,7 @@ }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1074,7 +1075,7 @@ }, { "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1101,12 +1102,12 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, @@ -1147,14 +1148,13 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_system_kernel_cpi" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1195,7 +1195,7 @@ "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1203,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1211,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1239,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1248,15 +1248,14 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi", - "ScaleUnit": "1per_instr" + "MetricName": "tma_info_thread_cpi" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1266,13 +1265,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "slots", + "MetricExpr": "TOPDOWN.SLOTS", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1288,14 +1287,14 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + "MetricThreshold": "tma_info_thread_uptb < 7.5" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active.", "MetricExpr": "tma_divider - tma_fp_divider", "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", "MetricName": "tma_int_divider", - "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_int_divider > 0.2 & (tma_divider > 0.2 & (= tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { @@ -1303,8 +1302,8 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { @@ -1312,7 +1311,7 @@ "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, @@ -1321,7 +1320,7 @@ "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", "MetricName": "tma_l1_latency_dependency", - "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & (tma_l1_boun= d > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, @@ -1331,7 +1330,7 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1340,7 +1339,7 @@ "MetricExpr": "5 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", "MetricName": "tma_l2_hit_latency", - "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & (tma_l2_bound > 0.= 05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, @@ -1350,17 +1349,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "(22.5 * tma_info_system_core_frequency - 5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricExpr": "17.5 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_bottleneck_cache_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { @@ -1368,18 +1367,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" }, { @@ -1396,7 +1395,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { @@ -1404,31 +1403,31 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_1g", - "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_2m", - "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses.", "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", "MetricName": "tma_load_stlb_miss_4k", - "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & (tma_load_stlb_= miss > 0.05 & (tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (tma_memory_boun= d > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1437,7 +1436,7 @@ "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1447,7 +1446,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", "ScaleUnit": "100%" }, { @@ -1457,15 +1456,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, @@ -1474,7 +1473,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, @@ -1485,11 +1484,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1511,7 +1510,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, @@ -1526,24 +1525,24 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", + "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", - "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details.", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D1@ / tma_info_core_core= _clks / 2", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", "MetricName": "tma_ms", "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", @@ -1554,7 +1553,7 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -1563,7 +1562,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1578,19 +1577,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { @@ -1634,8 +1633,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { @@ -1643,8 +1642,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { @@ -1652,7 +1651,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1661,7 +1660,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1670,14 +1669,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1690,7 +1689,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1699,7 +1698,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1709,7 +1708,7 @@ "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", "MetricThreshold": "tma_split_loads > 0.3", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { @@ -1718,8 +1717,8 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { @@ -1727,7 +1726,7 @@ "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, @@ -1736,8 +1735,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { @@ -1746,8 +1745,8 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { @@ -1755,8 +1754,8 @@ "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1773,7 +1772,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { @@ -1781,31 +1780,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_1g", - "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_2m", - "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses.", "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", "MetricName": "tma_store_stlb_miss_4k", - "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & (tma_store_stl= b_miss > 0.05 & (tma_dtlb_store > 0.05 & (tma_store_bound > 0.2 & (tma_memo= ry_bound > 0.2 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { @@ -1813,7 +1812,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1822,7 +1821,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1831,8 +1830,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF5981F463D for ; Sat, 22 Mar 2025 06:35:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625352; cv=none; b=OGn0myYJ5CwHkYzgpCw6jo7BkUynINh5ejHg0AXMzy0ivp/lB/a3kkSvDMQSpJrVI68LOlKsWpnzSkn+orZOkviqrBShXSBPtgWd1jw+UtdGDl3EeltobU8LSYykgx7lQfEROFwo5h2BJVov6MZcIwGKn3PVTUIpG3sQKcQMgW4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625352; c=relaxed/simple; bh=9QtWxO8l2GpcAYbRMHA4inZL25MuZzaCf9TT8yzkl7E=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=D5rLA5Vmcou5oC0edNcgDwUyqxwQ3rUqkCLhixTL/P496RpNy6jLKwi10DjCi5TKGLH2uTCb7gKong1XyuMZ8o3yVDVc/RfSjaq4f2Irfy/00fmMXq3kYnclSW2JLT1PeHMNG7eYCnCG4dJAR3R3aPlNVl7CoJW4PTXhMx3IP7o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Q+H5tnYo; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Q+H5tnYo" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e6373b4cbcfso3551325276.3 for ; Fri, 21 Mar 2025 23:35:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625348; x=1743230148; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=bjcCoWPLeFv7HToZ+VPWd4Z16z9IeZ5E4tfOnaC3rvg=; b=Q+H5tnYoc5ggcLb6jz8i1PS1WaUh5ypJy1XLQEBwBBNtgZ/EDhi1fuAOuFwt4+AdTx 9+c9gz0iQin+8wybQtknotC48wGzQABDKLrz/6Ago/1YIwSBHD7nf9sBGyc/Rv8c7OEN oRiEFKU593Re56lGXbnQyjFDEt2nfKpuz40sG4mb/Lo9i90jwkzOUk9urEkYljqbCich G0wqTFu5eo7cp/1JyljZUveP81C91X7LHbXUFUoEPwrJ+sWihSEp224K2y396smBkbiy hbKulqqOdIWEqP5N8iRilxwLmRiOU9kw/hGKcy/r12PUl8HtdFKVxVfDoLq5pU/id65W k77w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625348; x=1743230148; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bjcCoWPLeFv7HToZ+VPWd4Z16z9IeZ5E4tfOnaC3rvg=; b=GCETiyXxL2+OOrExsvJIxOJyeqyci6MVe1VOzH0youiJ4m6TZk8GbvPsdJpn0Ar9n7 +QpCJKYRk7DGiri3SJU/gOm1CweiAjpixgKc+ro2rPQmI90NW4akk4bkE/MLOv4C5s1q yHif+LKsUFLH4sH9RkORkn+Awd4AHCaZPixbamGJYdl6HGCdYMUNZyCW/LUe24CvKVo6 5vzWv4MYmApqlT1MkEibKsJqkG24XDFeJ1IyWwxQt1m6masgu/2th35hVZAG+7OzxtOa XSGATMl1kG6hwItizIHNwVEGuC3TsSgl5kFvTKMZnFRjhsydHZSaur3czCyGQzj/ZPHR Xl3Q== X-Forwarded-Encrypted: i=1; AJvYcCXOAI/p1POym/95ai97TqMnU7mQ3Kxfl10zPxTtbTddeb62dW/ptTmoGt4PAW0WAsDFNsf8CMVk65DrueE=@vger.kernel.org X-Gm-Message-State: AOJu0YwagSsFfWOPx4sUqeFmCmIaZc+ifmufNxQqYwLJHG7+1BhEjZQX dd/JFwnu6s1UyrbIZvaZ9funtUtewDEmiph/OUclRJFcpfNpcOv7+Zu5rL1H2jES4yOsFe0kfHH GZtkC9Q== X-Google-Smtp-Source: AGHT+IHlal+F6Vq3p1BPn1wNdGsQ0Opjwsps2JaO4JwoeBUks1WR9j8kYqFa269SIbIlPXdlXWu6LE+81Kk/ X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a5b:80c:0:b0:e60:a204:30cd with SMTP id 3f1490d57ef6-e66a4a80683mr6653276.0.1742625347764; Fri, 21 Mar 2025 23:35:47 -0700 (PDT) Date: Fri, 21 Mar 2025 23:34:01 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-34-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 33/35] perf vendor events: Update westmereep-dp events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and virtual memory. Signed-off-by: Ian Rogers --- .../arch/x86/westmereep-dp/cache.json | 32 +++++++++++++++ .../arch/x86/westmereep-dp/other.json | 40 ------------------- .../x86/westmereep-dp/virtual-memory.json | 8 ++++ 3 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json b/tool= s/perf/pmu-events/arch/x86/westmereep-dp/cache.json index 30845c7dbf08..f6f95f3ff301 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-dp/cache.json @@ -119,6 +119,38 @@ "SampleAfterValue": "100000", "UMask": "0x2" }, + { + "BriefDescription": "L1I instruction fetch stall cycles", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.CYCLES_STALLED", + "SampleAfterValue": "2000000", + "UMask": "0x4" + }, + { + "BriefDescription": "L1I instruction fetch hits", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.HITS", + "SampleAfterValue": "2000000", + "UMask": "0x1" + }, + { + "BriefDescription": "L1I instruction fetch misses", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.MISSES", + "SampleAfterValue": "2000000", + "UMask": "0x2" + }, + { + "BriefDescription": "L1I Instruction fetches", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.READS", + "SampleAfterValue": "2000000", + "UMask": "0x3" + }, { "BriefDescription": "All L2 data requests", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereep-dp/other.json b/tool= s/perf/pmu-events/arch/x86/westmereep-dp/other.json index bcf5bcf637c0..c0cf8bae8074 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-dp/other.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-dp/other.json @@ -15,46 +15,6 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, - { - "BriefDescription": "L1I instruction fetch stall cycles", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.CYCLES_STALLED", - "SampleAfterValue": "2000000", - "UMask": "0x4" - }, - { - "BriefDescription": "L1I instruction fetch hits", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.HITS", - "SampleAfterValue": "2000000", - "UMask": "0x1" - }, - { - "BriefDescription": "L1I instruction fetch misses", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.MISSES", - "SampleAfterValue": "2000000", - "UMask": "0x2" - }, - { - "BriefDescription": "L1I Instruction fetches", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.READS", - "SampleAfterValue": "2000000", - "UMask": "0x3" - }, - { - "BriefDescription": "Large ITLB hit", - "Counter": "0,1,2,3", - "EventCode": "0x82", - "EventName": "LARGE_ITLB.HIT", - "SampleAfterValue": "200000", - "UMask": "0x1" - }, { "BriefDescription": "Loads that partially overlap an earlier store= ", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.js= on b/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json index 53d7f76325a3..84c920637b12 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-dp/virtual-memory.json @@ -152,6 +152,14 @@ "SampleAfterValue": "200000", "UMask": "0x20" }, + { + "BriefDescription": "Large ITLB hit", + "Counter": "0,1,2,3", + "EventCode": "0x82", + "EventName": "LARGE_ITLB.HIT", + "SampleAfterValue": "200000", + "UMask": "0x1" + }, { "BriefDescription": "Retired loads that miss the DTLB (Precise Eve= nt)", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6737B1F4C9F for ; Sat, 22 Mar 2025 06:35:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625354; cv=none; b=K6uu2zznyNR8BbMQy+J4d5PktsqoFAN9o6thoMXRLSklWLSINAq+DwZx6V9hNi/zEu+90SmibrCS4aEnsirCC3WpOENhs98ecTT+5qpkhj7UdvvxeaYq9bKpzT2ctUeEkKRPJIhr7uTmZCquKJdOC75QGL5zYS5ZrLt5Mv4IVg0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625354; c=relaxed/simple; bh=9yS94I0V+aj5EaLixu70v/BvSVLiAUj4x7/o+VoVa7U=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=fLeCH+mzIp8ZigPmTGjkXNkGHFrdgyAqlDAIqByUePAiZnrank+og9xTuOKP1liqkF6H/rxDGnum0PbxIOEw5F6zV6xxSwFfhG/SMJMZ8AQ1JW5P+7BJ9/D4qcfK91KbAYBbE8ZwTNMX+7coQpLJ/YWKLu0Zmv+mXxxBBvc69SQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Cx4tmjNG; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Cx4tmjNG" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f27dd44f86so35986617b3.0 for ; Fri, 21 Mar 2025 23:35:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625350; x=1743230150; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=RPugHL8eCvwfstte7zHSdBSJ9ey/olpTRBhxpg/Uv8M=; b=Cx4tmjNGe7FemSucVIa1xrTGlqFkyVtflBIetllrtZxPCD5I7+vr2VK01sL4f3jO66 bie5y2rtZCKFq9HI/PAOX8flv/Wh3vktQxgyp4hsbCMVctMEd/mCiY1BU4pAGSCS9qfR Is/fe1lcpqaYHRJNA4OIFJedyr8PdpD9ZXMnOCwA+hMXlWrXIfWw8UykFIo6GVZ9LXTB t8JpHJvgPOURt9bNnAxld9463kGL4VpghplUlASY4IcGDeGQ80cw5Ip3yaS0gzX1dlUC PzWU8f01oAxhW6HZ+Mu+eFn4SqHF2pI1S5bKSH9Z6Ls0wkmNuiWU3+4U/W/Z/5gj7uYM zznw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625350; x=1743230150; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=RPugHL8eCvwfstte7zHSdBSJ9ey/olpTRBhxpg/Uv8M=; b=PPr0RqBRSQq0HUOfPUQGs9Z5vbSREqQ0ukO4Zo5Q9YCgKuvPjHSR9O0rFLOe30nOxF SdIN1n+A0LW7H8g9L0r8UWPYkPtUJcoNec8gAj2zSd0jKRLQNT+OK5T8pUJ7w8mitOkK HUv6Oj6fqT/TyUoQ6QJzm4obTHX1y4/Cth0qRyHZnpk9sPZS+YR8K/+oKxe8tc7yFAcZ H1YKTQvKr7xlHlWnQL2hzwcGAR8VZM7SciT7ajRM/51H5RZDTRDbcvRmQuG3sEt/B4Cq gq1JNnaEiUNzIXsvy9Am/t+8joYAVsPIjqPdTNVpAG6ffWT6eXsU0o1fN/3AUdduy09K nTyg== X-Forwarded-Encrypted: i=1; AJvYcCX7h6TXTs2pUdHK6CdT+Fp/mqPNQVDxzJGYLk4mPmpdPb8w22335xjB1dIMK3AGwEemWTOdAzLycLKmH28=@vger.kernel.org X-Gm-Message-State: AOJu0Yx7qLPYvnUZqvvWhXftztuyHYsK/CrVEu1ALQBPZAhfqgmtimia oM7kmHEYiTVfkd10JGAwAlUTdCTcj+5+33tno2v5IkyEO09scBN+U1Cu0AYtbF90dKfC56sGZRI JseA6Qw== X-Google-Smtp-Source: AGHT+IGAHXS3tyy6YtMd/WX/AAFoyykh3syurpSv8CtuFOXyC5ySlA7LHh7KfvrF+hNofP3vnoB97NOTY24o X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:a290:0:b0:e5b:3241:79f7 with SMTP id 3f1490d57ef6-e66a50f3111mr5957276.3.1742625350196; Fri, 21 Mar 2025 23:35:50 -0700 (PDT) Date: Fri, 21 Mar 2025 23:34:02 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-35-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 34/35] perf vendor events: Update westmereep-dp events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and virtual memory. Signed-off-by: Ian Rogers --- .../arch/x86/westmereep-sp/cache.json | 32 +++++++++++++++ .../arch/x86/westmereep-sp/other.json | 40 ------------------- .../x86/westmereep-sp/virtual-memory.json | 8 ++++ 3 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/westmereep-sp/cache.json b/tool= s/perf/pmu-events/arch/x86/westmereep-sp/cache.json index 90cb367f5798..0cd571472dca 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-sp/cache.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-sp/cache.json @@ -119,6 +119,38 @@ "SampleAfterValue": "100000", "UMask": "0x2" }, + { + "BriefDescription": "L1I instruction fetch stall cycles", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.CYCLES_STALLED", + "SampleAfterValue": "2000000", + "UMask": "0x4" + }, + { + "BriefDescription": "L1I instruction fetch hits", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.HITS", + "SampleAfterValue": "2000000", + "UMask": "0x1" + }, + { + "BriefDescription": "L1I instruction fetch misses", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.MISSES", + "SampleAfterValue": "2000000", + "UMask": "0x2" + }, + { + "BriefDescription": "L1I Instruction fetches", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.READS", + "SampleAfterValue": "2000000", + "UMask": "0x3" + }, { "BriefDescription": "All L2 data requests", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereep-sp/other.json b/tool= s/perf/pmu-events/arch/x86/westmereep-sp/other.json index bcf5bcf637c0..c0cf8bae8074 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-sp/other.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-sp/other.json @@ -15,46 +15,6 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, - { - "BriefDescription": "L1I instruction fetch stall cycles", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.CYCLES_STALLED", - "SampleAfterValue": "2000000", - "UMask": "0x4" - }, - { - "BriefDescription": "L1I instruction fetch hits", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.HITS", - "SampleAfterValue": "2000000", - "UMask": "0x1" - }, - { - "BriefDescription": "L1I instruction fetch misses", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.MISSES", - "SampleAfterValue": "2000000", - "UMask": "0x2" - }, - { - "BriefDescription": "L1I Instruction fetches", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.READS", - "SampleAfterValue": "2000000", - "UMask": "0x3" - }, - { - "BriefDescription": "Large ITLB hit", - "Counter": "0,1,2,3", - "EventCode": "0x82", - "EventName": "LARGE_ITLB.HIT", - "SampleAfterValue": "200000", - "UMask": "0x1" - }, { "BriefDescription": "Loads that partially overlap an earlier store= ", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereep-sp/virtual-memory.js= on b/tools/perf/pmu-events/arch/x86/westmereep-sp/virtual-memory.json index e7affdf7f41b..a1b22c82a9bf 100644 --- a/tools/perf/pmu-events/arch/x86/westmereep-sp/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/westmereep-sp/virtual-memory.json @@ -128,6 +128,14 @@ "SampleAfterValue": "200000", "UMask": "0x20" }, + { + "BriefDescription": "Large ITLB hit", + "Counter": "0,1,2,3", + "EventCode": "0x82", + "EventName": "LARGE_ITLB.HIT", + "SampleAfterValue": "200000", + "UMask": "0x1" + }, { "BriefDescription": "Retired loads that miss the DTLB (Precise Eve= nt)", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog From nobody Thu Dec 18 13:40:49 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 074CC1D5CC6 for ; Sat, 22 Mar 2025 06:35:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625356; cv=none; b=gfimEAxj7EIis2gFXFxDFwFI5zNJeIkrueXOw4IKncpF7y49NCy/4fRhxyyOxWPJOcz69Eymfc6T6aZ34JVbGAJoekEWmPI9WOU8rumLVY3V7P5RxvfUqsyreQJRidcM7ftwRmaY+a/gGcvka3dqt7Ge3yI33iTILcMQdqRps5g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742625356; c=relaxed/simple; bh=ldgIIS6qJKA4PqUti5NYcAmN4SsICYce46h1OQ5emEo=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=A76MObDxxgaY4OlKOZTcXq9/wAw/2VD33btpgjBDJcdwYGf/UIuWG9wokSaXnXeZb8T7CzpyR8wAO6Ul8geTc2MMi6HqTBlnDW/hjnuBCPNeJ/djgj0DdwkeJOTn3zvW6sO/fDshdH7DfY+PXzLyMOHahTthgTPlIaWET1IzKTI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=B5nw0xk1; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="B5nw0xk1" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6f27dd44f86so35987127b3.0 for ; Fri, 21 Mar 2025 23:35:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742625353; x=1743230153; darn=vger.kernel.org; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :from:to:cc:subject:date:message-id:reply-to; bh=XET9HH8V4TfXAFNsd7WcnaOZ0oJ/0MNCJs/F3W4WSf4=; b=B5nw0xk1NhEUnN0T2zfqw82Ef0lk4CN54vcwSPP86OTTFp/z2KFPZzvADQprYz0IC3 HpZmHacjLUIkxF1KofChB0f1zyUIVyw6DV/0dnKSHArxoDg3xpYyAXaeinRpGTEj79hR fy2J5gXnPu42qt2yPT7qyTkbM0/6ghuKxH3qOoFUwgzFMqKtFw3IiJrCOy38pakgsil7 Y1M9flUlCoSYPLjmshuLAToLGV97N78N7gOUWxp7ntWoB0tI1WO8UYJSbDnYTOVV/rz1 521EJEDbf7abujjCTmP8SSHpNDsJNVEMt4VWVmsmNsxPMNLvGqou45Cx558ODMZToM4z E5XA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742625353; x=1743230153; h=to:from:subject:references:mime-version:message-id:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XET9HH8V4TfXAFNsd7WcnaOZ0oJ/0MNCJs/F3W4WSf4=; b=AROigSAJXuXFUmPO4YDrv2XrRBbYTAYr1AuoL8iEdvyHMLBfhD19W64AJhQvKH/V/p PytENbMFaL0KNdA+ddW2fCF5J7BAu7klQ1DLuF2I7Q5NuiB/OL6ALWaPM3TOrS21saFa 4yYjUPOsFEQOP9sZydEz1xyed3PStz7nhBM8YEdbEohf4DVa/Ew6cpjMGbb/TgR76prU RuaZxkznKu1MfuB/Go9QyT3ZGCMN0pwj0A5PE1IxJYDX4iDaQEdCUztrQ478o5N+fGWE qPRgIflTFAWs/3lfREZl9QwlGy0/VXw96jqzJgCt3zOOwJUGx6r+M+Y33oaaij/XwV6s Sl6w== X-Forwarded-Encrypted: i=1; AJvYcCX/5h4Zy4M3hxzuz3ZWPmIqOsMitJjbM84xxsXuOMXiqerzJxSKrcRxqIawPbefoSifUgc0vGvw8ARpD1o=@vger.kernel.org X-Gm-Message-State: AOJu0YxMFvXJOKiP6Z9H2npOQoJ+W6AB501NhKvTqr8VM+N45g6btCYJ x6Ub1MxZPIuJztNa99dee3Vge9MhXr5W5d71LV2HdD+2MLyerSYhz8YRLtNwJoUjYfe8g1lqkUP i955UzA== X-Google-Smtp-Source: AGHT+IGE6LS58FWTpcecky7/duYOa1d0qwi4FZCOOefSBHXHy8h+SSHpUy73+FcIPAjAWWsHQb5Fobqeqm8T X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:c16d:a1c1:1823:1d0e]) (user=irogers job=sendgmr) by 2002:a25:9090:0:b0:e64:3d36:bea5 with SMTP id 3f1490d57ef6-e66a4ff61efmr5909276.9.1742625352664; Fri, 21 Mar 2025 23:35:52 -0700 (PDT) Date: Fri, 21 Mar 2025 23:34:03 -0700 In-Reply-To: <20250322063403.364981-1-irogers@google.com> Message-Id: <20250322063403.364981-36-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250322063403.364981-1-irogers@google.com> X-Mailer: git-send-email 2.49.0.395.g12beb8f557-goog Subject: [PATCH v1 35/35] perf vendor events: Update westmereep-dp events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Maxime Coquelin , Alexandre Torgue , Caleb Biggers , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Thomas Falcon Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update event topic moving other topic events to cache and virtual memory. Signed-off-by: Ian Rogers --- .../pmu-events/arch/x86/westmereex/cache.json | 32 +++++++++++++++ .../pmu-events/arch/x86/westmereex/other.json | 40 ------------------- .../arch/x86/westmereex/virtual-memory.json | 8 ++++ 3 files changed, 40 insertions(+), 40 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/westmereex/cache.json b/tools/p= erf/pmu-events/arch/x86/westmereex/cache.json index 9f922370ee8b..2a677d10f688 100644 --- a/tools/perf/pmu-events/arch/x86/westmereex/cache.json +++ b/tools/perf/pmu-events/arch/x86/westmereex/cache.json @@ -119,6 +119,38 @@ "SampleAfterValue": "100000", "UMask": "0x2" }, + { + "BriefDescription": "L1I instruction fetch stall cycles", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.CYCLES_STALLED", + "SampleAfterValue": "2000000", + "UMask": "0x4" + }, + { + "BriefDescription": "L1I instruction fetch hits", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.HITS", + "SampleAfterValue": "2000000", + "UMask": "0x1" + }, + { + "BriefDescription": "L1I instruction fetch misses", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.MISSES", + "SampleAfterValue": "2000000", + "UMask": "0x2" + }, + { + "BriefDescription": "L1I Instruction fetches", + "Counter": "0,1,2,3", + "EventCode": "0x80", + "EventName": "L1I.READS", + "SampleAfterValue": "2000000", + "UMask": "0x3" + }, { "BriefDescription": "All L2 data requests", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereex/other.json b/tools/p= erf/pmu-events/arch/x86/westmereex/other.json index bcf5bcf637c0..c0cf8bae8074 100644 --- a/tools/perf/pmu-events/arch/x86/westmereex/other.json +++ b/tools/perf/pmu-events/arch/x86/westmereex/other.json @@ -15,46 +15,6 @@ "SampleAfterValue": "2000000", "UMask": "0x1" }, - { - "BriefDescription": "L1I instruction fetch stall cycles", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.CYCLES_STALLED", - "SampleAfterValue": "2000000", - "UMask": "0x4" - }, - { - "BriefDescription": "L1I instruction fetch hits", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.HITS", - "SampleAfterValue": "2000000", - "UMask": "0x1" - }, - { - "BriefDescription": "L1I instruction fetch misses", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.MISSES", - "SampleAfterValue": "2000000", - "UMask": "0x2" - }, - { - "BriefDescription": "L1I Instruction fetches", - "Counter": "0,1,2,3", - "EventCode": "0x80", - "EventName": "L1I.READS", - "SampleAfterValue": "2000000", - "UMask": "0x3" - }, - { - "BriefDescription": "Large ITLB hit", - "Counter": "0,1,2,3", - "EventCode": "0x82", - "EventName": "LARGE_ITLB.HIT", - "SampleAfterValue": "200000", - "UMask": "0x1" - }, { "BriefDescription": "Loads that partially overlap an earlier store= ", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/westmereex/virtual-memory.json = b/tools/perf/pmu-events/arch/x86/westmereex/virtual-memory.json index 0c3501e6e5a3..1800c6ecbf80 100644 --- a/tools/perf/pmu-events/arch/x86/westmereex/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/westmereex/virtual-memory.json @@ -152,6 +152,14 @@ "SampleAfterValue": "200000", "UMask": "0x20" }, + { + "BriefDescription": "Large ITLB hit", + "Counter": "0,1,2,3", + "EventCode": "0x82", + "EventName": "LARGE_ITLB.HIT", + "SampleAfterValue": "200000", + "UMask": "0x1" + }, { "BriefDescription": "Retired loads that miss the DTLB (Precise Eve= nt)", "Counter": "0,1,2,3", --=20 2.49.0.395.g12beb8f557-goog